Members:
For this project, we analyzed data that involves the problem of the protection of Internet of Things (IoT) devices such as Amazon echo, Google Nest, Sonos speakers, etc. The data set comes from a real IoT network with 105 different devices. The creators of the dataset have attacked the network in 7 different ways on a total of 33 occasions. We found that the original dataset is very large (3GB), so we started with the 5% sample provided for us and then expanded with the use of a larger part of the dataset.
Since we need to classify within 8 classes, let's consider that when choosing at random, the probability of a correct classification is 12.5%. Of course we have to try to aim for a better accuracy, since we can imagine that a trained human can correctly identify all of these attacks.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from scipy import stats
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
We started by importing the 5% sample of our IoT dataset and implementing some data visualization in order to better understand the data, such as the features present.
df_5 = pd.read_csv("data/IoT_5Percent.csv")
df_5.head()
| flow_duration | Header_Length | Protocol Type | Duration | Rate | Srate | Drate | fin_flag_number | syn_flag_number | rst_flag_number | ... | Std | Tot size | IAT | Number | Magnitue | Radius | Covariance | Variance | Weight | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.092170 | 16692.50 | 17.00 | 64.0 | 4124.115328 | 4124.115328 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 50.00 | 8.312373e+07 | 9.5 | 10.000000 | 0.00000 | 0.000000 | 0.00 | 141.55 | DDoS-UDP_Flood |
| 1 | 0.152987 | 17635.00 | 17.00 | 64.0 | 3601.595667 | 3601.595667 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 50.00 | 8.310619e+07 | 9.5 | 10.000000 | 0.00000 | 0.000000 | 0.00 | 141.55 | DDoS-UDP_Flood |
| 2 | 0.000000 | 0.00 | 1.00 | 64.0 | 163.479196 | 163.479196 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 42.00 | 8.312400e+07 | 9.5 | 9.165151 | 0.00000 | 0.000000 | 0.00 | 141.55 | DDoS-ICMP_Flood |
| 3 | 0.000000 | 54.00 | 6.00 | 64.0 | 6.207933 | 6.207933 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 54.00 | 8.333214e+07 | 9.5 | 10.392305 | 0.00000 | 0.000000 | 0.00 | 141.55 | DDoS-PSHACK_Flood |
| 4 | 0.006760 | 199.78 | 16.67 | 64.0 | 37.627478 | 37.627478 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 3.970412 | 177.28 | 8.300370e+07 | 9.5 | 18.997839 | 5.63636 | 153.637263 | 0.11 | 141.55 | DoS-UDP_Flood |
5 rows × 47 columns
df_5.shape
(2332150, 47)
The dataset contains 2332150 entries and 47 columns. Now let's show more information about the dataset and the types of the columns.
df_5.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2332150 entries, 0 to 2332149 Data columns (total 47 columns): # Column Dtype --- ------ ----- 0 flow_duration float64 1 Header_Length float64 2 Protocol Type float64 3 Duration float64 4 Rate float64 5 Srate float64 6 Drate float64 7 fin_flag_number float64 8 syn_flag_number float64 9 rst_flag_number float64 10 psh_flag_number float64 11 ack_flag_number float64 12 ece_flag_number float64 13 cwr_flag_number float64 14 ack_count float64 15 syn_count float64 16 fin_count float64 17 urg_count float64 18 rst_count float64 19 HTTP float64 20 HTTPS float64 21 DNS float64 22 Telnet float64 23 SMTP float64 24 SSH float64 25 IRC float64 26 TCP float64 27 UDP float64 28 DHCP float64 29 ARP float64 30 ICMP float64 31 IPv float64 32 LLC float64 33 Tot sum float64 34 Min float64 35 Max float64 36 AVG float64 37 Std float64 38 Tot size float64 39 IAT float64 40 Number float64 41 Magnitue float64 42 Radius float64 43 Covariance float64 44 Variance float64 45 Weight float64 46 label object dtypes: float64(46), object(1) memory usage: 836.3+ MB
There are a total of 47 columns (46 features of type float64, and 1 class label of type object). We found that the data did not contain any categorical features (only numerical ones), we don't need to worry about them in this preprocessing part. Below we checked whether we had any missing values. We are happy to see that we have none!
df_5.isna().sum().sum()
0
Then we searched for columns that contain binary values.
bool_cols = [col for col in df_5
if np.isin(df_5[col].dropna().unique(), [0, 1]).all()]
bool_cols
['fin_flag_number', 'syn_flag_number', 'rst_flag_number', 'psh_flag_number', 'ack_flag_number', 'ece_flag_number', 'cwr_flag_number', 'HTTP', 'HTTPS', 'DNS', 'Telnet', 'SMTP', 'SSH', 'IRC', 'TCP', 'UDP', 'DHCP', 'ARP', 'ICMP', 'IPv', 'LLC']
To see other important information about the dataset, we ran summary statistics:
df_5.describe()
| flow_duration | Header_Length | Protocol Type | Duration | Rate | Srate | Drate | fin_flag_number | syn_flag_number | rst_flag_number | ... | AVG | Std | Tot size | IAT | Number | Magnitue | Radius | Covariance | Variance | Weight | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | ... | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 | 2.332150e+06 |
| mean | 5.757926e+00 | 7.714112e+04 | 9.072652e+00 | 6.634822e+01 | 9.091876e+03 | 9.091876e+03 | 3.048512e-06 | 8.665952e-02 | 2.077379e-01 | 9.049761e-02 | ... | 1.247483e+02 | 3.329294e+01 | 1.247587e+02 | 8.316881e+07 | 9.497866e+00 | 1.312542e+01 | 4.705157e+01 | 3.055702e+04 | 9.630161e-02 | 1.414963e+02 |
| std | 2.857280e+02 | 4.637787e+05 | 8.956664e+00 | 1.403039e+01 | 9.990555e+04 | 9.990555e+04 | 2.209338e-03 | 2.813355e-01 | 4.056882e-01 | 2.868934e-01 | ... | 2.408751e+02 | 1.599584e+02 | 2.414911e+02 | 1.700245e+07 | 8.170173e-01 | 8.631892e+00 | 2.262434e+02 | 3.116256e+05 | 2.326818e-01 | 2.101296e+01 |
| min | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 4.200000e+01 | 0.000000e+00 | 4.200000e+01 | 0.000000e+00 | 1.000000e+00 | 9.165151e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 |
| 25% | 0.000000e+00 | 5.400000e+01 | 6.000000e+00 | 6.400000e+01 | 2.091515e+00 | 2.091515e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 5.000000e+01 | 0.000000e+00 | 5.000000e+01 | 8.307156e+07 | 9.500000e+00 | 1.000000e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.415500e+02 |
| 50% | 0.000000e+00 | 5.400000e+01 | 6.000000e+00 | 6.400000e+01 | 1.579892e+01 | 1.579892e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 5.400000e+01 | 0.000000e+00 | 5.400000e+01 | 8.312452e+07 | 9.500000e+00 | 1.039230e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.415500e+02 |
| 75% | 1.051811e-01 | 2.726300e+02 | 1.436000e+01 | 6.400000e+01 | 1.173987e+02 | 1.173987e+02 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 5.405011e+01 | 3.719096e-01 | 5.406000e+01 | 8.334391e+07 | 9.500000e+00 | 1.039673e+01 | 5.059213e-01 | 1.344216e+00 | 8.000000e-02 | 1.415500e+02 |
| max | 1.314562e+05 | 9.840879e+06 | 4.700000e+01 | 2.550000e+02 | 8.388608e+06 | 8.388608e+06 | 2.989952e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | ... | 8.609545e+03 | 8.493570e+03 | 9.622800e+03 | 1.676394e+08 | 1.500000e+01 | 1.284702e+02 | 1.203616e+04 | 7.290803e+07 | 1.000000e+00 | 2.446000e+02 |
8 rows × 46 columns
The summary statistics tell us various information about the features, which can be quite useful. But in order to best visualize the data and have a more in depth idea of what we will be working with, we found it best to analyze the data with visual plots and graphs, as shown later.
The target value we need to predict is the "label" column. In reality, we noticed that many labels were associated with overall the same type of attack, so we grouped them into more general categories, with the hope that our models could work better with less labels (as shown in this paper: https://www.mdpi.com/1424-8220/23/13/5941#B99-sensors-23-05941). Now, let's plot the distribution of these labels and create a new "category_label", which will contain the group for each single value.
df_5.groupby("label").size().sort_values(ascending=False)
label DDoS-ICMP_Flood 358566 DDoS-UDP_Flood 270120 DDoS-TCP_Flood 224376 DDoS-PSHACK_Flood 204981 DDoS-SYN_Flood 203200 DDoS-RSTFINFlood 202274 DDoS-SynonymousIP_Flood 179873 DoS-UDP_Flood 166013 DoS-TCP_Flood 133466 DoS-SYN_Flood 101658 BenignTraffic 54489 Mirai-greeth_flood 49684 Mirai-udpplain 44356 Mirai-greip_flood 37752 DDoS-ICMP_Fragmentation 22633 MITM-ArpSpoofing 15414 DDoS-UDP_Fragmentation 14330 DDoS-ACK_Fragmentation 14300 DNS_Spoofing 8846 Recon-HostDiscovery 6531 Recon-OSScan 4983 Recon-PortScan 4110 DoS-HTTP_Flood 3663 VulnerabilityScan 1906 DDoS-HTTP_Flood 1443 DDoS-SlowLoris 1151 DictionaryBruteForce 661 BrowserHijacking 303 CommandInjection 286 SqlInjection 273 XSS 172 Backdoor_Malware 161 Recon-PingSweep 110 Uploading_Attack 66 dtype: int64
df_5.label.value_counts().plot(kind='bar')
plt.show()
We can see that the dataset is very unbalanced in this way. Let's now combine attacks into categories and plot again the distribution.
dict_8classes = {}
dict_8classes['DDoS-RSTFINFlood'] = 'DDoS'
dict_8classes['DDoS-PSHACK_Flood'] = 'DDoS'
dict_8classes['DDoS-SYN_Flood'] = 'DDoS'
dict_8classes['DDoS-UDP_Flood'] = 'DDoS'
dict_8classes['DDoS-TCP_Flood'] = 'DDoS'
dict_8classes['DDoS-ICMP_Flood'] = 'DDoS'
dict_8classes['DDoS-SynonymousIP_Flood'] = 'DDoS'
dict_8classes['DDoS-ACK_Fragmentation'] = 'DDoS'
dict_8classes['DDoS-UDP_Fragmentation'] = 'DDoS'
dict_8classes['DDoS-ICMP_Fragmentation'] = 'DDoS'
dict_8classes['DDoS-SlowLoris'] = 'DDoS'
dict_8classes['DDoS-HTTP_Flood'] = 'DDoS'
dict_8classes['DoS-UDP_Flood'] = 'DoS'
dict_8classes['DoS-SYN_Flood'] = 'DoS'
dict_8classes['DoS-TCP_Flood'] = 'DoS'
dict_8classes['DoS-HTTP_Flood'] = 'DoS'
dict_8classes['Mirai-greeth_flood'] = 'Mirai'
dict_8classes['Mirai-greip_flood'] = 'Mirai'
dict_8classes['Mirai-udpplain'] = 'Mirai'
dict_8classes['Recon-PingSweep'] = 'Recon'
dict_8classes['Recon-OSScan'] = 'Recon'
dict_8classes['Recon-PortScan'] = 'Recon'
dict_8classes['VulnerabilityScan'] = 'Recon'
dict_8classes['Recon-HostDiscovery'] = 'Recon'
dict_8classes['DNS_Spoofing'] = 'Spoofing'
dict_8classes['MITM-ArpSpoofing'] = 'Spoofing'
dict_8classes['BrowserHijacking'] = 'Web'
dict_8classes['Backdoor_Malware'] = 'Web'
dict_8classes['XSS'] = 'Web'
dict_8classes['Uploading_Attack'] = 'Web'
dict_8classes['SqlInjection'] = 'Web'
dict_8classes['CommandInjection'] = 'Web'
dict_8classes['DictionaryBruteForce'] = 'BruteForce'
dict_8classes['BenignTraffic'] = 'BenignTraffic'
category_labels = [dict_8classes[label] for label in df_5['label']]
df_5["category_label"] = category_labels
df_5.groupby("category_label").size().sort_values(ascending=False)
category_label DDoS 1697247 DoS 404800 Mirai 131792 BenignTraffic 54489 Spoofing 24260 Recon 17640 Web 1261 BruteForce 661 dtype: int64
Now to visualize our regrouped attacks:
df_5.category_label.value_counts().plot(kind='bar')
<Axes: >
The classes are extremely unbalanced. Thinking about the future, we would probably need a much more balanced dataset. Because we have a bigger dataset (the original) available, and the best way to deal with unbalanced data would be to collect more data (and not to undersample or oversample) we decided to create a different dataset, with 20k entries for each of the different categories (when available). In this way we should be able to get a more balanced dataset, with real entries.
The full dataset is obtained from this kaggle repository: https://www.kaggle.com/datasets/madhavmalhotra/unb-cic-iot-dataset/data
# Uncomment this to run -> run cell below to read file directly
# THIS WAS RUN ON GOOGLE COLAB TO NOT DOWLOAD EVERYTHING ON PERSONAL MACHINE, DIRECTORY ARE DIFFERENT
"""
!rm -r ~/.kaggle
!mkdir ~/.kaggle
!mv ./kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download madhavmalhotra/unb-cic-iot-dataset
!unzip -q /content/unb-cic-iot-dataset.zip -d .
import os
import pandas as pd
dict_8classes = {}
dict_8classes['DDoS-RSTFINFlood'] = 'DDoS'
dict_8classes['DDoS-PSHACK_Flood'] = 'DDoS'
dict_8classes['DDoS-SYN_Flood'] = 'DDoS'
dict_8classes['DDoS-UDP_Flood'] = 'DDoS'
dict_8classes['DDoS-TCP_Flood'] = 'DDoS'
dict_8classes['DDoS-ICMP_Flood'] = 'DDoS'
dict_8classes['DDoS-SynonymousIP_Flood'] = 'DDoS'
dict_8classes['DDoS-ACK_Fragmentation'] = 'DDoS'
dict_8classes['DDoS-UDP_Fragmentation'] = 'DDoS'
dict_8classes['DDoS-ICMP_Fragmentation'] = 'DDoS'
dict_8classes['DDoS-SlowLoris'] = 'DDoS'
dict_8classes['DDoS-HTTP_Flood'] = 'DDoS'
dict_8classes['DoS-UDP_Flood'] = 'DoS'
dict_8classes['DoS-SYN_Flood'] = 'DoS'
dict_8classes['DoS-TCP_Flood'] = 'DoS'
dict_8classes['DoS-HTTP_Flood'] = 'DoS'
dict_8classes['Mirai-greeth_flood'] = 'Mirai'
dict_8classes['Mirai-greip_flood'] = 'Mirai'
dict_8classes['Mirai-udpplain'] = 'Mirai'
dict_8classes['Recon-PingSweep'] = 'Recon'
dict_8classes['Recon-OSScan'] = 'Recon'
dict_8classes['Recon-PortScan'] = 'Recon'
dict_8classes['VulnerabilityScan'] = 'Recon'
dict_8classes['Recon-HostDiscovery'] = 'Recon'
dict_8classes['DNS_Spoofing'] = 'Spoofing'
dict_8classes['MITM-ArpSpoofing'] = 'Spoofing'
dict_8classes['BrowserHijacking'] = 'Web'
dict_8classes['Backdoor_Malware'] = 'Web'
dict_8classes['XSS'] = 'Web'
dict_8classes['Uploading_Attack'] = 'Web'
dict_8classes['SqlInjection'] = 'Web'
dict_8classes['CommandInjection'] = 'Web'
dict_8classes['DictionaryBruteForce'] = 'BruteForce'
dict_8classes['BenignTraffic'] = 'BenignTraffic'
df = pd.DataFrame()
for category in ['DDoS', 'DoS', 'Mirai', 'Recon', 'Spoofing', 'Web', 'BruteForce', 'BenignTraffic']:
print('starting: ', category)
df_cat = pd.DataFrame()
for i in os.listdir('/content/wataiData/csv/CICIoT2023'):
df_temp = pd.read_csv('/content/wataiData/csv/CICIoT2023/' + i)
category_label = []
for label in df_temp['label']:
category_label.append(dict_7classes[label])
df_temp['category_label'] = category_label
df_temp = df_temp.query("category_label==@category").copy()
df_cat = pd.concat([df_cat, df_temp])
if len(df_cat) >= 20000:
df = pd.concat([df, df_cat.head(20000)].copy())
break
if category == 'BruteForce':
df = pd.concat([df, df_cat.head(20000)].copy())
break
df.category_label.value_counts
df.shape
df.to_csv('/content/full_balanced_data.csv', index=None)
"""
df = pd.read_csv("data/full_balanced_data.csv")
df.head()
| flow_duration | Header_Length | Protocol Type | Duration | Rate | Srate | Drate | fin_flag_number | syn_flag_number | rst_flag_number | ... | Tot size | IAT | Number | Magnitue | Radius | Covariance | Variance | Weight | label | category_label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.391010 | 86.40 | 6.0 | 64.00 | 0.353898 | 0.353898 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 54.00 | 8.333083e+07 | 9.5 | 10.392305 | 0.000000 | 0.000000 | 0.00 | 141.55 | DDoS-PSHACK_Flood | DDoS |
| 1 | 0.000000 | 0.00 | 1.0 | 64.00 | 0.995891 | 0.995891 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 42.00 | 8.313207e+07 | 9.5 | 9.165151 | 0.000000 | 0.000000 | 0.00 | 141.55 | DDoS-ICMP_Flood | DDoS |
| 2 | 0.000000 | 54.69 | 6.0 | 63.78 | 29852.697509 | 29852.697509 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 54.69 | 8.306737e+07 | 9.5 | 10.402806 | 0.659309 | 7.245867 | 0.03 | 141.55 | DDoS-TCP_Flood | DDoS |
| 3 | 0.891138 | 91.80 | 6.0 | 64.00 | 2.940833 | 2.940833 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 54.00 | 8.336111e+07 | 9.5 | 10.392305 | 0.000000 | 0.000000 | 0.00 | 141.55 | DDoS-SynonymousIP_Flood | DDoS |
| 4 | 0.000000 | 0.00 | 1.0 | 64.00 | 2.011545 | 2.011545 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 42.00 | 8.312779e+07 | 9.5 | 9.165151 | 0.000000 | 0.000000 | 0.00 | 141.55 | DDoS-ICMP_Flood | DDoS |
5 rows × 48 columns
df.groupby("category_label").size().sort_values(ascending=False)
category_label BenignTraffic 20000 DDoS 20000 DoS 20000 Mirai 20000 Recon 20000 Spoofing 20000 Web 20000 BruteForce 13064 dtype: int64
df.category_label.value_counts().plot(kind='bar')
<Axes: >
By seeing the plot and the number of entries for each attack, we can now work with a more balanced dataset, and is ready for an exploratory data analysis. It is important to note that the BruteForce label contains less entries, because overall in the original dataset BruteForce didn't contain as many entries compared to the other labels.
We can now create a big test dataset that we will use to evaluate the classifiers that we will train.
from sklearn.model_selection import train_test_split
df_DDoS = df[df['category_label'] == 'DDoS'].copy()
df_DoS = df[df['category_label'] == 'DoS'].copy()
df_Mirai = df[df['category_label'] == 'Mirai'].copy()
df_Recon = df[df['category_label'] == 'Recon'].copy()
df_Spoofing = df[df['category_label'] == 'Spoofing'].copy()
df_BenignTraffic = df[df['category_label'] == 'BenignTraffic'].copy()
df_Web = df[df['category_label'] == 'Web'].copy()
df_BruteForce = df[df['category_label'] == 'BruteForce'].copy()
dfs = [df_DDoS, df_DoS, df_Mirai, df_Recon, df_Spoofing, df_BenignTraffic, df_Web, df_BruteForce]
dfs_use = []
dfs_final_test = []
for df in dfs:
df_train, df_test = train_test_split(df, train_size=10000, random_state=42)
dfs_use.append(df_train)
dfs_final_test.append(df_test)
df = pd.concat(dfs_use)
df_test = pd.concat(dfs_final_test)
df.category_label.value_counts().plot(kind='bar')
<Axes: >
df_test.category_label.value_counts().plot(kind='bar')
<Axes: >
df_test.to_csv("data/final_test_data.csv", index=None)
We will not be touching this final test data, and use it to evaluate the final classifiers. The test dataset contains 10k entries for each label, except for BruteForcing that has less entries.
We begin by importing the necessary libraries:
import seaborn as sns
from ipywidgets import interact, widgets
In this section, we want to see if any of our features contain significant outliers and data that might be unneccesary for our analysis and that may skew our results. It is important to note that all of these analysis will be performed by label, this is becuase in this specific case (attacks) the outliers might be true outliers, and by working on grouped labels we can avoid removing data that are naturally different between each other.
columns = df.drop(['label', 'category_label'], axis=1).columns
With the following helper functions, we can create some interactive plots that make it easier for us to analyze the data without plotting too many distributions at one.
def create_boxplot(df):
def plot(column):
plt.figure(figsize=(6, 4))
sorted_categories = sorted(df['category_label'].unique())
sns.boxplot(x='category_label', y=column, data=df, order=sorted_categories)
plt.title(f'Boxplot for {column}')
plt.xticks(rotation=45)
plt.show()
columns = df.drop(['label', 'category_label'], axis=1).columns
dropdown_menu = widgets.Dropdown(options=columns, description='Select Column')
interact(plot, column=dropdown_menu)
def create_hist(df):
def plot(column):
plt.figure(figsize=(12, 16))
columns_per_row = 2
unique_categories = sorted(df['category_label'].unique())
num_categories = len(unique_categories)
num_rows = (num_categories + columns_per_row - 1) // columns_per_row
overall_min = df[column].min()
overall_max = df[column].max()
for i, category in enumerate(unique_categories, start=1):
plt.subplot(num_rows, columns_per_row, i)
bins = 50
bin_edges = [overall_min + (overall_max - overall_min) * j / bins for j in range(bins + 1)]
sns.histplot(df[df['category_label'] == category][column], bins=bin_edges, kde=True)
plt.title(f'Histogram for {column} - {category}')
plt.xlabel(column)
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
columns = df.drop(['label', 'category_label'], axis=1).columns
dropdown_menu = widgets.Dropdown(options=columns, description='Select Column')
interact(plot, column=dropdown_menu)
create_boxplot(df)
interactive(children=(Dropdown(description='Select Column', options=('flow_duration', 'Header_Length', 'Protoc…
As part of the visualization of the data, we can determine the minimum and maximum values, median, and the quartiles for each label.
create_hist(df)
interactive(children=(Dropdown(description='Select Column', options=('flow_duration', 'Header_Length', 'Protoc…
Above we created histograms to best visualize the distribution of the data based on the different labels.
To best determine what features to drop, we also created a correlation matrix to see what features might be correlated with each other. When two features are correlated they usually have almost the same impact on the prediction, so one can be dropped.
plt.figure(figsize=(25,25))
sns.heatmap(df[columns].corr(), cbar=True, cmap='Blues')
plt.show()
Let's now split the dataframe into the training and testing data, so that we can work on removing some outliers/features from the training data alone.
The test data that we are creating here will be used later to evaluate models when testing different paramethers (faster then testing on big test set).
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train.category_label.value_counts().plot(kind='bar')
<Axes: >
df_test.category_label.value_counts().plot(kind='bar')
<Axes: >
The training and testing data contains 8000 entries and 2000 entries respectively.
In order to logically remove outliers based on the label, working with dataframes we divided for each label. We removed very few of them, mainly to remove them from columns that would contain otherwise mostly the same value. Then to proceed we remove these columns.
df_DDoS = df_train[df_train['category_label'] == 'DDoS'].copy()
df_DoS = df_train[df_train['category_label'] == 'DoS'].copy()
df_Mirai = df_train[df_train['category_label'] == 'Mirai'].copy()
df_Recon = df_train[df_train['category_label'] == 'Recon'].copy()
df_Spoofing = df_train[df_train['category_label'] == 'Spoofing'].copy()
df_BenignTraffic = df_train[df_train['category_label'] == 'BenignTraffic'].copy()
df_Web = df_train[df_train['category_label'] == 'Web'].copy()
df_BruteForce = df_train[df_train['category_label'] == 'BruteForce'].copy()
dfs = [df_DDoS, df_DoS, df_Mirai, df_Recon, df_Spoofing, df_BenignTraffic, df_Web, df_BruteForce]
df_noOut = pd.DataFrame()
for df_small in dfs:
Q1 = df_small.quantile(0.05, numeric_only=True)
Q3 = df_small.quantile(0.95, numeric_only=True)
IQR = Q3 - Q1
df__small_noOut= df_small[~((df_small<(Q1-1.5*IQR))|(df_small>(Q3+1.5*IQR))).any(axis=1)]
df_noOut = pd.concat([df_noOut, df__small_noOut])
df_noOut
| flow_duration | Header_Length | Protocol Type | Duration | Rate | Srate | Drate | fin_flag_number | syn_flag_number | rst_flag_number | ... | Tot size | IAT | Number | Magnitue | Radius | Covariance | Variance | Weight | label | category_label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9216 | 0.179236 | 34353.0 | 17.0 | 64.0 | 5381.270971 | 5381.270971 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 50.0 | 8.309874e+07 | 9.5 | 10.000000 | 0.000000 | 0.000000 | 0.0 | 141.55 | DDoS-UDP_Flood | DDoS |
| 8745 | 0.000000 | 54.0 | 6.0 | 64.0 | 3.500237 | 3.500237 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 54.0 | 8.307247e+07 | 9.5 | 10.392305 | 0.000000 | 0.000000 | 0.0 | 141.55 | DDoS-TCP_Flood | DDoS |
| 767 | 0.000000 | 54.0 | 6.0 | 64.0 | 1.964445 | 1.964445 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 54.0 | 8.334505e+07 | 9.5 | 10.392305 | 0.000000 | 0.000000 | 0.0 | 141.55 | DDoS-RSTFINFlood | DDoS |
| 4240 | 0.056592 | 24416.0 | 17.0 | 64.0 | 8687.939069 | 8687.939069 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 50.0 | 8.309854e+07 | 9.5 | 10.000000 | 0.000000 | 0.000000 | 0.0 | 141.55 | DDoS-UDP_Flood | DDoS |
| 1553 | 0.141661 | 41525.0 | 17.0 | 64.0 | 5861.688641 | 5861.688641 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 50.0 | 8.310643e+07 | 9.5 | 10.000000 | 0.000000 | 0.000000 | 0.0 | 141.55 | DDoS-UDP_Flood | DDoS |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 146738 | 29.741062 | 2474.7 | 5.4 | 89.8 | 0.843541 | 0.843541 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 75.8 | 1.666014e+08 | 13.5 | 20.869937 | 328.053795 | 53972.480847 | 1.0 | 244.60 | DictionaryBruteForce | BruteForce |
| 148186 | 32.902408 | 4817.0 | 5.9 | 70.3 | 2.185764 | 2.185764 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 268.2 | 1.666013e+08 | 13.5 | 19.926096 | 259.399693 | 35165.118743 | 1.0 | 244.60 | DictionaryBruteForce | BruteForce |
| 142819 | 151.224092 | 7710.7 | 8.1 | 100.9 | 8.198359 | 8.198359 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 79.5 | 1.666013e+08 | 13.5 | 13.066891 | 66.793175 | 2241.983920 | 1.0 | 244.60 | DictionaryBruteForce | BruteForce |
| 147482 | 52.893145 | 61552.7 | 6.0 | 60.6 | 64.318027 | 64.318027 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1138.0 | 1.666014e+08 | 13.5 | 47.619821 | 1349.119123 | 914883.417976 | 1.0 | 244.60 | DictionaryBruteForce | BruteForce |
| 150398 | 225.584879 | 51052.9 | 9.3 | 80.5 | 15.716158 | 15.716158 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 85.6 | 4.000211e-03 | 5.5 | 13.742173 | 38.597231 | 876.179947 | 0.9 | 38.50 | DictionaryBruteForce | BruteForce |
55749 rows × 48 columns
df_noOut.category_label.value_counts().plot(kind="bar")
<Axes: >
nunique = df_noOut.nunique()
cols_to_drop = nunique[nunique == 1].index
df_noOut.drop(cols_to_drop, axis=1, inplace = True)
#df_noOut.sum()
df_noOut.shape
(55749, 37)
create_boxplot(df_noOut)
interactive(children=(Dropdown(description='Select Column', options=('flow_duration', 'Header_Length', 'Protoc…
create_hist(df_noOut)
interactive(children=(Dropdown(description='Select Column', options=('flow_duration', 'Header_Length', 'Protoc…
Now let's see how the dataset and the distributions have changed. Looking at the duration label, the boxplots show less outliers compared to the original dataset.
df_noOut.groupby("category_label").size().sort_values(ascending=False)
category_label Spoofing 7541 DDoS 7032 BenignTraffic 7018 Mirai 7005 Web 6991 BruteForce 6872 Recon 6839 DoS 6451 dtype: int64
We just dropped columns that have the same values, now we look for the columns that have a correlation higher than 95%.
corr_matrix= df_noOut.corr()
plt.figure(figsize=(25,25))
sns.heatmap(corr_matrix, annot= True, cbar=True, cmap='Blues')
plt.show()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
df_noOut.drop(to_drop, axis=1, inplace=True)
The columns with a correlation higher than 95% have now been dropped; We can now visualize the altered correlation matrix.
corr_matrix= df_noOut.corr()
plt.figure(figsize=(25,25))
sns.heatmap(corr_matrix, annot= True, cbar=True, cmap='Blues')
plt.show()
df_noOut.shape
(55749, 32)
create_boxplot(df_noOut)
interactive(children=(Dropdown(description='Select Column', options=('flow_duration', 'Header_Length', 'Protoc…
create_hist(df_noOut)
interactive(children=(Dropdown(description='Select Column', options=('flow_duration', 'Header_Length', 'Protoc…
df_noOut.to_csv("data/df_train.csv", index=None)
df_test[df_noOut.columns].to_csv("data/df_test.csv", index=None)
We now have df_train, df_test, df_test_final.
In this section of the project, we train linear classifiers and will evaluate:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from joblib import dump, load
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
classification_report,
ConfusionMatrixDisplay
)
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
import time
In our analysis, we saw that there was a slightly advantage to use the MinMaxScaler (mainly time-wise) so we are using that one.
df_train = pd.read_csv("data/df_train.csv")
df_test = pd.read_csv("data/df_test.csv")
print(df_train.shape)
df_test.shape
(55749, 32)
(16000, 32)
df_train.columns
Index(['flow_duration', 'Header_Length', 'Protocol Type', 'Duration', 'Rate',
'fin_flag_number', 'syn_flag_number', 'rst_flag_number',
'psh_flag_number', 'ack_flag_number', 'ack_count', 'syn_count',
'fin_count', 'urg_count', 'rst_count', 'HTTP', 'HTTPS', 'SSH', 'TCP',
'UDP', 'ICMP', 'Tot sum', 'Min', 'Max', 'AVG', 'Std', 'Tot size', 'IAT',
'Covariance', 'Variance', 'label', 'category_label'],
dtype='object')
X_train = df_train.drop(columns=["label", "category_label"])
X_test = df_test.drop(columns=["label", "category_label"])
y_train = df_train["category_label"]
y_test = df_test["category_label"]
Now we are scaling values in the train and the test set (everything in this case, because there are only numerical features) to have better performances and predictions.
scaler = MinMaxScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Let's save the standars scaler also for the future
dump(scaler, "scalers/MinMax_scaler.bin", compress=True)
['scalers/MinMax_scaler.bin']
X_train
array([[6.23926880e-05, 3.51799317e-03, 3.48251141e-01, ...,
4.95699232e-01, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 5.52998665e-06, 1.09276559e-01, ...,
4.95542567e-01, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 5.52998665e-06, 1.09276559e-01, ...,
4.97168543e-01, 0.00000000e+00, 0.00000000e+00],
...,
[5.26417584e-02, 7.89630890e-04, 1.54898979e-01, ...,
9.93807472e-01, 3.51855034e-04, 1.00000000e+00],
[1.84123318e-02, 6.30343721e-03, 1.09276559e-01, ...,
9.93807866e-01, 1.43581019e-01, 1.00000000e+00],
[7.85270688e-02, 5.22818251e-03, 1.80968933e-01, ...,
2.38578678e-11, 1.37506930e-04, 9.00000000e-01]])
X_test
array([[6.71990368e-04, 4.35761002e-01, 1.09276559e-01, ...,
9.93822377e-01, 2.77462212e-01, 1.00000000e+00],
[2.15150703e-03, 2.72177750e-04, 3.48251141e-01, ...,
3.81115471e-11, 4.00248523e-04, 9.00000000e-01],
[7.44866824e-05, 4.80391988e-05, 1.09276559e-01, ...,
8.85326117e-13, 3.23289405e-05, 9.00000000e-01],
...,
[6.43533149e-02, 4.53481435e-03, 1.22311536e-01, ...,
9.93870220e-01, 3.66891909e-04, 1.00000000e+00],
[3.28807419e-02, 6.47223493e-03, 3.00456224e-01, ...,
8.11236054e-11, 5.70437350e-04, 9.00000000e-01],
[0.00000000e+00, 5.52998665e-06, 1.09276559e-01, ...,
4.97162423e-01, 0.00000000e+00, 0.00000000e+00]])
To ensure consistency, we converted the labels from strings to numerical values.
label2id = {'BenignTraffic': 0, 'Mirai': 1, 'DDoS': 2, 'Spoofing': 3, 'Web': 4, 'Recon': 5, 'BruteForce': 6, 'DoS': 7}
id2label = {0: 'BenignTraffic', 1: 'Mirai', 2: 'DDoS', 3: 'Spoofing', 4: 'Web', 5: 'Recon', 6: 'BruteForce', 7: 'DoS'}
y_train = [label2id[label] for label in y_train]
y_test = [label2id[label] for label in y_test]
df_test_final = pd.read_csv("data/final_test_data.csv")
df_test_final = df_test_final[df_train.columns]
X_test_final = df_test_final.drop(columns=["label", "category_label"])
y_test_final = df_test_final["category_label"]
X_test_final = scaler.transform(X_test_final)
y_test_final = [label2id[label] for label in y_test_final]
X_test_final
array([[0.00000000e+00, 5.52998665e-06, 1.09276559e-01, ...,
4.97189731e-01, 0.00000000e+00, 0.00000000e+00],
[2.11623367e-05, 3.54072756e-03, 3.48251141e-01, ...,
4.95748301e-01, 0.00000000e+00, 0.00000000e+00],
[1.62244423e-05, 1.59187117e-03, 3.48251141e-01, ...,
4.95693809e-01, 1.35708219e-06, 5.00000000e-02],
...,
[5.96485960e-02, 4.05759698e-03, 2.04866391e-01, ...,
9.93659510e-01, 2.30069310e-02, 1.00000000e+00],
[1.29483675e-05, 1.89453246e-05, 1.35346513e-01, ...,
9.93807282e-01, 3.50791102e-04, 1.00000000e+00],
[1.85072518e-02, 8.11929025e-03, 1.17966544e-01, ...,
9.93807611e-01, 1.38643743e-04, 1.00000000e+00]])
df_train.columns
Index(['flow_duration', 'Header_Length', 'Protocol Type', 'Duration', 'Rate',
'fin_flag_number', 'syn_flag_number', 'rst_flag_number',
'psh_flag_number', 'ack_flag_number', 'ack_count', 'syn_count',
'fin_count', 'urg_count', 'rst_count', 'HTTP', 'HTTPS', 'SSH', 'TCP',
'UDP', 'ICMP', 'Tot sum', 'Min', 'Max', 'AVG', 'Std', 'Tot size', 'IAT',
'Covariance', 'Variance', 'label', 'category_label'],
dtype='object')
df_test_final.category_label.value_counts().plot(kind='bar')
<Axes: >
Here we import our preprocessed dataset to try linear models and see which of them predict better.
We are going to work with the following linear classifiers:
This helper function will help us plot the different metrics for each classifier
def plot_metric(eval_dict, model_name, param_name, param_values, log_scale=False):
fig, ax = plt.subplots(3, 2, figsize=(12, 8), sharex=True)
metrics = {
'Accuracy': 'list_accuracy',
'Precision': 'list_precision',
'Recall': 'list_recall',
'Specificity': 'list_specificity',
'F1': 'list_f1',
'Training time': 'list_training_time'
}
colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:olive', 'tab:cyan', 'tab:red']
for i, (metric_name, metric_key) in enumerate(metrics.items()):
x_value = list(range(0, len(eval_dict[metric_key])))
ax[i // 2, i % 2].plot(x_value, eval_dict[metric_key], c=colors[i])
if log_scale:
ax[i // 2, i % 2].set_xscale('log')
# Add xticks at the position of the maximum value
max_idx = np.argmax(eval_dict[metric_key])
max_value = eval_dict[metric_key][max_idx]
ax[i // 2, i % 2].axvline(x=x_value[max_idx], color='lightgray', linestyle='--')
ax[i // 2, i % 2].axhline(y=max_value, color='lightgray', linestyle='--')
ax[i // 2, i % 2].scatter(x_value[max_idx], max_value, color='red', marker='o')
ax[i // 2, i % 2].set_xticks(ticks=x_value, labels=param_values, rotation=90)
ax[i // 2, i % 2].set_title(f'{metric_name} ({model_name})')
ax[i // 2, i % 2].set_xlabel(param_name)
# Add y-axis label (tick) for the maximum value
ax[i // 2, i % 2].set_ylabel(metric_name)
yticks = list(ax[i // 2, i % 2].get_yticks())
del yticks[-2:]
yticks.append(max_value)
ax[i // 2, i % 2].set_yticks(yticks)
if metric_name != 'Training time':
ax[i // 2, i % 2].text(max_idx, eval_dict[metric_key][np.argmin(eval_dict[metric_key])], f'{param_values[max_idx]}', c='red')
plt.tight_layout()
plt.show()
Before running the model, we must create a list of parameters for Logistic Regression and a list for each metric we want to evalute. In this case, we have chosen to work on the C parameter (regularization parameter) to look at the behaviour of Logistic Regression.
list_param_LR = np.linspace(0.1, 1.0, 19)
list_param_LR
array([0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 ,
0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
# Initialize variables to keep track of the best model
best_accuracy = 0.0
for val in list_param_LR:
# Create classifier instance
model = LogisticRegression(max_iter=5000, tol=1e-3, C=val)
# Register start time
t_start = time.time()
# Fit the classifier on the training data
model.fit(X_train, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = model.predict(X_test)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(model, 'models/linear/logistic_best.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
Now we plot the performance.
The metrics are based on the hyperparameter C, which:
plot_metric(eval_dict, 'Logistic Regression', 'C', list_param_LR)
As we can see, Logistic Regression is very fast by nature but the accuracy metric and others are not so high, which could be considered when evaluating the best model.
Now we run it on the bigger test set.
clf = load('models/linear/logistic_best.joblib')
y_pred = clf.predict(X_test_final)
print(classification_report(y_test_final, y_pred,target_names=label2id))
precision recall f1-score support
BenignTraffic 0.65 0.68 0.67 10000
Mirai 0.99 0.97 0.98 10000
DDoS 0.80 0.57 0.66 10000
Spoofing 0.65 0.58 0.62 10000
Web 0.53 0.53 0.53 10000
Recon 0.65 0.43 0.52 10000
BruteForce 0.21 0.51 0.30 3064
DoS 0.64 0.82 0.72 10000
accuracy 0.65 73064
macro avg 0.64 0.64 0.62 73064
weighted avg 0.68 0.65 0.65 73064
After rerunning the model on the bigger test set, we can see that the Accuracy metric's difference is almost negligible (.638 and .65 respectively), indicating that the model is not at risk of overfitting.
ConfusionMatrixDisplay.from_predictions(y_test_final, y_pred, display_labels=label2id, cmap='Blues', normalize='true', xticks_rotation=45)
plt.show()
We can see that the model performs especially well when classifying the Mirai attack because 97% of the Mirai attacks have been correctly classified. On the other hand, the model performs the worst when classifying the Recon attack, with only about 43% of them being classified correctly.
In Gaussian Naive Bayes we have a parameter that controls the smoothing of the variance. We used default parameters from linear classifiers class exercise.
list_param_NB = np.logspace(0, -9, 19)
list_param_NB
array([1.00000000e+00, 3.16227766e-01, 1.00000000e-01, 3.16227766e-02,
1.00000000e-02, 3.16227766e-03, 1.00000000e-03, 3.16227766e-04,
1.00000000e-04, 3.16227766e-05, 1.00000000e-05, 3.16227766e-06,
1.00000000e-06, 3.16227766e-07, 1.00000000e-07, 3.16227766e-08,
1.00000000e-08, 3.16227766e-09, 1.00000000e-09])
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
# Initialize variables to keep track of the best model
best_accuracy = 0.0
for val in list_param_NB:
# Create classifier instance
model = GaussianNB(var_smoothing=val)
# Register start time
t_start = time.time()
# Fit the classifier on the training data
model.fit(X_train, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = model.predict(X_test)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(model, 'models/linear/NB_best.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
plot_metric(eval_dict, 'Naive Bayes', 'var_smoothing', list_param_NB)
clf = load('models/linear/NB_best.joblib')
y_pred = clf.predict(X_test_final)
print(classification_report(y_test_final, y_pred,target_names=label2id))
precision recall f1-score support
BenignTraffic 0.49 0.72 0.58 10000
Mirai 1.00 0.94 0.97 10000
DDoS 0.85 0.49 0.62 10000
Spoofing 0.54 0.30 0.39 10000
Web 0.41 0.51 0.46 10000
Recon 0.74 0.38 0.50 10000
BruteForce 0.19 0.41 0.26 3064
DoS 0.63 0.85 0.72 10000
accuracy 0.59 73064
macro avg 0.61 0.58 0.56 73064
weighted avg 0.65 0.59 0.59 73064
After rerunning the model on the bigger test set, we can see that the Accuracy metric's difference is once again almost negligible, indicating that the model is not at risk of overfitting.
ConfusionMatrixDisplay.from_predictions(y_test_final, y_pred, display_labels=label2id, cmap='Blues', normalize='true', xticks_rotation=45)
plt.show()
The hyperparameter chosen is C, in which:
list_param_SVM = np.linspace(0.1, 2.0, 20)
list_param_SVM
array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3,
1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. ])
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
# Initialize variables to keep track of the best model
best_accuracy = 0.0
for val in list_param_SVM:
model = LinearSVC(C=val, max_iter=10000)
# Register start time
t_start = time.time()
# Fit the classifier on the training data
model.fit(X_train, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = model.predict(X_test)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(model, 'models/linear/linearSVC_best.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
plot_metric(eval_dict, 'Linear SVM', 'C', list_param_SVM)
Above we can see that as C increases (the margin decreases), we see a higher performance of our metrics.
clf = load('models/linear/linearSVC_best.joblib')
y_pred = clf.predict(X_test_final)
print(classification_report(y_test_final, y_pred,target_names=label2id))
precision recall f1-score support
BenignTraffic 0.65 0.72 0.68 10000
Mirai 0.98 0.97 0.98 10000
DDoS 0.79 0.59 0.67 10000
Spoofing 0.66 0.56 0.61 10000
Web 0.54 0.56 0.55 10000
Recon 0.64 0.42 0.51 10000
BruteForce 0.22 0.48 0.30 3064
DoS 0.64 0.81 0.72 10000
accuracy 0.65 73064
macro avg 0.64 0.64 0.63 73064
weighted avg 0.68 0.65 0.66 73064
ConfusionMatrixDisplay.from_predictions(y_test_final, y_pred, display_labels=label2id, cmap='Blues', normalize='true', xticks_rotation=45)
plt.show()
Also here we are happy to see that the metrics remain consistent between the small test and the big test dataset. The model is quite good at classifying Mirai, but not so great at the others (with an exception of DoS)
In this section of the project, we train non linear classifiers and will evaluate it using the same metrics described above.
Some of the following code is a repetition from before, this is because originally we were using multiple notebooks and preferred to keep the structure to run single sections if needed.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy import stats
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
import time
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix,
classification_report,
ConfusionMatrixDisplay,
precision_recall_curve,
average_precision_score,
roc_curve,
roc_auc_score
)
from sklearn.calibration import CalibrationDisplay
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_validate
df_train = pd.read_csv("data/df_train.csv")
df_test = pd.read_csv("data/df_test.csv")
print(df_train.shape)
df_test.shape
(106461, 37)
(30613, 37)
df_train.columns
Index(['flow_duration', 'Header_Length', 'Protocol Type', 'Duration', 'Rate',
'Srate', 'fin_flag_number', 'syn_flag_number', 'rst_flag_number',
'psh_flag_number', 'ack_flag_number', 'ack_count', 'syn_count',
'fin_count', 'urg_count', 'rst_count', 'HTTP', 'HTTPS', 'SSH', 'TCP',
'UDP', 'ICMP', 'Tot sum', 'Min', 'Max', 'AVG', 'Std', 'Tot size', 'IAT',
'Number', 'Magnitue', 'Radius', 'Covariance', 'Variance', 'Weight',
'label', 'category_label'],
dtype='object')
X_train = df_train.drop(columns=["label", "category_label"])
X_test = df_test.drop(columns=["label", "category_label"])
y_train = df_train["category_label"]
y_test = df_test["category_label"]
Now we are scaling values in the train and the test set (everything in this case, because there are only numerical features) to have better performances and predictions.
standard_scaler = StandardScaler().fit(X_train)
X_train = standard_scaler.transform(X_train)
X_test = standard_scaler.transform(X_test)
Let's save the standars scaler also for the future
#dump(standard_scaler, "scalers/standard_scaler.bin", compress=True)
X_train
array([[-0.35783634, -0.38945825, -0.96776639, ..., -0.34611033,
-1.22351325, -0.00148732],
[-0.35783634, -0.38941128, -0.52971559, ..., -0.34611033,
-1.22351325, -0.00148732],
[-0.35756658, -0.38939812, -0.52971559, ..., -0.34611033,
-1.22351325, -0.00148732],
...,
[-0.24805674, -0.38403351, -0.43334441, ..., -0.34371997,
1.03459445, 1.28693162],
[ 0.84589391, -0.29711208, -0.14423088, ..., -0.34169314,
0.35716214, -1.28990626],
[-0.13634996, -0.38450125, -0.24060206, ..., -0.26270529,
1.03459445, 1.28693162]])
X_test
array([[-0.35779965, -0.38222422, 0.41910245, ..., -0.34610864,
-0.95254032, -0.00148732],
[-0.21566835, -0.38648426, -0.28440714, ..., 0.16990613,
1.03459445, 1.28693162],
[-0.28462578, 0.15191701, -0.63484778, ..., 0.24150108,
1.03459445, 1.28693162],
...,
[-0.35613823, -0.3893892 , -0.52971559, ..., -0.34611033,
-1.22351325, -0.00148732],
[-0.31089163, -0.38812514, 0.00470639, ..., -0.34443892,
0.80878368, -1.28990626],
[ 0.63260148, -0.38384772, -0.27564612, ..., -0.32699794,
0.92168906, -0.00148732]])
label2id = {'BenignTraffic': 0, 'Mirai': 1, 'DDoS': 2, 'Spoofing': 3, 'Web': 4, 'Recon': 5, 'BruteForce': 6, 'DoS': 7}
id2label = {0: 'BenignTraffic', 1: 'Mirai', 2: 'DDoS', 3: 'Spoofing', 4: 'Web', 5: 'Recon', 6: 'BruteForce', 7: 'DoS'}
y_train = [label2id[label] for label in y_train]
y_test = [label2id[label] for label in y_test]
Let's prepare also the final test we will use with each best classifier
df_test_final = pd.read_csv("data/final_test_data.csv")
df_test_final = df_test_final[df_train.columns]
X_test_final = df_test_final.drop(columns=["label", "category_label"])
y_test_final = df_test_final["category_label"]
X_test_final = standard_scaler.transform(X_test_final)
y_test_final = [label2id[label] for label in y_test_final]
X_test_final
array([[-0.35783634, -0.38941128, -0.52971559, ..., -0.34611033,
-1.22351325, -0.00148732],
[-0.35754794, -0.35938168, 0.43399618, ..., -0.34611033,
-1.22351325, -0.00148732],
[-0.35761523, -0.37593616, 0.43399618, ..., -0.34609245,
-1.11060786, -0.00148732],
...,
[ 0.45506423, -0.35499115, -0.14423088, ..., -0.04296205,
1.03459445, 1.28693162],
[-0.35765988, -0.38929732, -0.42458339, ..., -0.34148817,
1.03459445, 1.28693162],
[-0.10561657, -0.32048926, -0.49467152, ..., -0.3442835 ,
1.03459445, 1.28693162]])
df_train.columns
Index(['flow_duration', 'Header_Length', 'Protocol Type', 'Duration', 'Rate',
'Srate', 'fin_flag_number', 'syn_flag_number', 'rst_flag_number',
'psh_flag_number', 'ack_flag_number', 'ack_count', 'syn_count',
'fin_count', 'urg_count', 'rst_count', 'HTTP', 'HTTPS', 'SSH', 'TCP',
'UDP', 'ICMP', 'Tot sum', 'Min', 'Max', 'AVG', 'Std', 'Tot size', 'IAT',
'Number', 'Magnitue', 'Radius', 'Covariance', 'Variance', 'Weight',
'label', 'category_label'],
dtype='object')
df_test_final.category_label.value_counts().plot(kind='bar')
<Axes: >
def plot_metric(eval_dict, model_name, param_name, param_values, log_scale=False):
fig, ax = plt.subplots(3, 2, figsize=(12, 8), sharex=True)
metrics = {
'Accuracy': 'list_accuracy',
'Precision': 'list_precision',
'Recall': 'list_recall',
'Specificity': 'list_specificity',
'F1': 'list_f1',
'Training time': 'list_training_time'
}
colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:olive', 'tab:cyan', 'tab:red']
for i, (metric_name, metric_key) in enumerate(metrics.items()):
x_value = list(range(0, len(eval_dict[metric_key])))
ax[i // 2, i % 2].plot(x_value, eval_dict[metric_key], c=colors[i])
if log_scale:
ax[i // 2, i % 2].set_xscale('log')
# Add xticks at the position of the maximum value
max_idx = np.argmax(eval_dict[metric_key])
max_value = eval_dict[metric_key][max_idx]
ax[i // 2, i % 2].axvline(x=x_value[max_idx], color='lightgray', linestyle='--')
ax[i // 2, i % 2].axhline(y=max_value, color='lightgray', linestyle='--')
ax[i // 2, i % 2].scatter(x_value[max_idx], max_value, color='red', marker='o')
ax[i // 2, i % 2].set_xticks(ticks=x_value, labels=param_values, rotation=90)
ax[i // 2, i % 2].set_title(f'{metric_name} ({model_name})')
ax[i // 2, i % 2].set_xlabel(param_name)
# Add y-axis label (tick) for the maximum value
ax[i // 2, i % 2].set_ylabel(metric_name)
yticks = list(ax[i // 2, i % 2].get_yticks())
del yticks[-2:]
yticks.append(max_value)
ax[i // 2, i % 2].set_yticks(yticks)
if metric_name != 'Training time':
ax[i // 2, i % 2].text(max_idx, eval_dict[metric_key][np.argmin(eval_dict[metric_key])], f'{param_values[max_idx]}', c='red')
plt.tight_layout()
plt.show()
Now let's create the list of parameters for k-Nearest-Neighbors and a list for each metrics: we will evaluate them for each parameter in the list. In this case, we have chosen to do from 1 to 50 neighbors because the dataset is big, so we could try more neighbors than just 5 having good perfomances. Then, we will plot and analyze the metrics.
list_param_k_nearest_neighbors = range(1,50)
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
# Initialize variables to keep track of the best model
best_accuracy = 0.0
for val in list_param_k_nearest_neighbors:
# Create classifier instance
clf = KNeighborsClassifier(n_neighbors=val, )
# Register start time
t_start = time.time()
# Fit the classifier on the training data
clf.fit(X_train, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = clf.predict(X_test)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(clf, 'models/nonLinear/knn_best.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
plot_metric(eval_dict, 'KNN', '# of neighbors', list_param_k_nearest_neighbors)
As we can see from the metrics, for the number of neighbors between 1 and 11 (21 for precision) the scores of the metrics start to grow, but from 11 (21 for precision) it decreases; This is because analyzing too many neighbors at once could result in underfitting- the association of wrong labels (for the property of the algorithm). Also, we had the greatest time performances with kNN: this is because it does not do anything complicated, it just needs to calculate the distances between point in the n-dimension space and K neighbors, so these are very basic actions to perform for the machine.
let's now load the best (k = 11) model and analyze the confusion matrix for this model
clf = load('models/nonLinear/knn_best.joblib')
y_pred = clf.predict(X_test_final)
print(classification_report(y_test_final, y_pred,target_names=label2id))
precision recall f1-score support
BenignTraffic 0.63 0.72 0.67 10000
Mirai 1.00 0.97 0.99 10000
DDoS 0.84 0.86 0.85 10000
Spoofing 0.70 0.58 0.63 10000
Web 0.57 0.63 0.60 10000
Recon 0.78 0.52 0.63 10000
BruteForce 0.30 0.54 0.38 3064
DoS 0.80 0.82 0.81 10000
accuracy 0.72 73064
macro avg 0.70 0.71 0.70 73064
weighted avg 0.74 0.72 0.73 73064
ConfusionMatrixDisplay.from_predictions(y_test_final, y_pred, display_labels=label2id, cmap='Blues', normalize='true', xticks_rotation=45)
plt.show()
We can already see that it performs better, and we are happy to see that the DDoS and the DoS are better recognized between each other (before many DDoS were classified as DoS)
Now we will try with Decision Tree.Let's create the list of parameters for Decision Tree classifier and a list for each metrics: we will evaluate them for each parameter in the list. In this case, we have chosen to do from 3 to 200 as a max_depth, and a step of 5 to look at the behaviour of Decision Tree classifier based on how much it can increase its perfomance with the increasing of the max_depth we want to reach. Then, we will plot and analyze the metrics.
list_param_decision_tree = range(3, 200, 5)
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
# Initialize variables to keep track of the best model
best_accuracy = 0.0
for val in list_param_decision_tree:
# Create classifier instance
clf = DecisionTreeClassifier(max_depth=val, random_state=777)
# Register start time
t_start = time.time()
# Fit the classifier on the training data
clf.fit(X_train, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = clf.predict(X_test)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(clf, 'models/nonLinear/DT_best.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
plot_metric(eval_dict, 'Decision Tree', 'Max Depth', list_param_decision_tree)
As we can see from the plots, from around a max_depth of 23-25 the metrics are converging. In this case, another important metric to consider is time: note that the greater the max_depth we want to reach, greater will be the peaks of the time. It's much faster than other classifiers because Decison Trees work well on large datasets. Other than time perfomances, we had a higher accuracy, precision, recall, specificity and f1 values than the other classifiers: this is because Decision Trees work in general very well on multi labelled dataset.
let's now load the best (max_depth = 23) model and analyze the confusion matrix for this model
clf = load('models/nonLinear/DT_best.joblib')
y_pred = clf.predict(X_test_final)
print(classification_report(y_test_final, y_pred, target_names=label2id))
precision recall f1-score support
BenignTraffic 0.86 0.85 0.86 10000
Mirai 1.00 0.98 0.99 10000
DDoS 1.00 0.96 0.98 10000
Spoofing 0.84 0.81 0.83 10000
Web 0.76 0.86 0.80 10000
Recon 0.86 0.77 0.81 10000
BruteForce 0.59 0.77 0.67 3064
DoS 1.00 0.98 0.99 10000
accuracy 0.88 73064
macro avg 0.86 0.87 0.86 73064
weighted avg 0.89 0.88 0.88 73064
ConfusionMatrixDisplay.from_predictions(y_test_final, y_pred, display_labels=label2id, cmap='Blues', normalize='true', xticks_rotation=45)
plt.show()
As seen above, the label-based performance is significantly better than the other models.
Now we will try with Kernel Support Vector Machines.Let's create the list of parameters for Kernel Support Vector Machines classifier and a list for each metrics: we will evaluate them for each parameter in the list. In this case, we have chosen to work on the C parameter (regularization parameter) to look at the behaviour of Kernel Support Vector Machines classifier.Then, we will plot and analyze the metrics.
list_param_kernel_support_vector_machines = [0.1, 0.5, 1.0, 2.5, 5.0]
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
# Initialize variables to keep track of the best model
best_accuracy = 0.0
for val in list_param_kernel_support_vector_machines:
# Create classifier instance
clf = SVC(C=val, random_state=777)
# Register start time
t_start = time.time()
# Fit the classifier on the training data
clf.fit(X_train, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = clf.predict(X_test)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(clf, 'models/nonLinear/kernel_SVM_best.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
plot_metric(eval_dict, 'Kernel Support Vector Machines', 'C', list_param_kernel_support_vector_machines)
As we can see, Kernel SVM is a very slow non linear classifier for this dataset: this is because we are working on a relatively large dataset, and we know that complexity of training kernelized SVM grows quadratically with number of training samples. The general trend of performance remains the same: as C increases, the performance increases.
let's now load the best (C=) model and analyze the confusion matrix for this model
clf = load('models/nonLinear/kernel_SVM_best.joblib')
y_pred = clf.predict(X_test_final)
It takes quite long to predict (4 min)
print(classification_report(y_test_final, y_pred,target_names=label2id))
precision recall f1-score support
BenignTraffic 0.73 0.74 0.74 10000
Mirai 1.00 0.97 0.98 10000
DDoS 0.76 0.72 0.74 10000
Spoofing 0.66 0.63 0.65 10000
Web 0.59 0.72 0.65 10000
Recon 0.85 0.51 0.64 10000
BruteForce 0.30 0.60 0.40 3064
DoS 0.76 0.75 0.75 10000
accuracy 0.72 73064
macro avg 0.71 0.71 0.69 73064
weighted avg 0.74 0.72 0.72 73064
ConfusionMatrixDisplay.from_predictions(y_test_final, y_pred, display_labels=label2id, cmap='Blues', normalize='true', xticks_rotation=45)
plt.show()
Now we will try with Neural Networks. In this case, the parameter under consideration is the number of hidden layers to look at the behaviour of the NN classifier. Then, we will plot and analyze the metrics. We chose to keep a low number of perceptrons within each layer in order to maintain time efficiency and also considering the size and complexity of our dataset.
list_param_neural_networks = [(10),(50),(50, 10),(50, 50),(100, 50, 10)]
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
In this case, the default number of iterations (200) was not sufficient to make the optimization converge, so we needed to increase it.
# Initialize variables to keep track of the best model
best_accuracy = 0.0
for val in list_param_neural_networks:
# Create classifier instance
clf = MLPClassifier(hidden_layer_sizes=val, max_iter=2000, random_state=777, early_stopping=True, n_iter_no_change=100)
# Register start time
t_start = time.time()
# Fit the classifier on the training data
clf.fit(X_train, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = clf.predict(X_test)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(clf, 'models/nonLinear/NN_best.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
plot_metric(eval_dict, 'MLP Classifier', 'Hidden Layers', list_param_neural_networks)
A we can see, the time increases when the hidden layer size is greater (obviously, because this means that it has to pass through more layers). The peaks for the other metrics seems to be around the range of 150-200 hidden layers, then values start to decrease despite the number of hidden layers increase.
One of the possible explanations for this performance is because the neural network with more layers "overfits" the training data and performs worse on test data as a result. A neural network with less layers tends to generalize more, and therefore may make more sense considering the low complexity of our data.
let's now load the best (hidden_layer = ) model and analyze the confusion matrix for this model
clf = load('models/nonLinear/NN_best.joblib')
y_pred = clf.predict(X_test_final)
print(classification_report(y_test_final, y_pred,target_names=label2id))
precision recall f1-score support
BenignTraffic 0.73 0.75 0.74 10000
Mirai 1.00 0.96 0.98 10000
DDoS 0.93 0.91 0.92 10000
Spoofing 0.75 0.63 0.68 10000
Web 0.57 0.69 0.63 10000
Recon 0.73 0.50 0.59 10000
BruteForce 0.28 0.56 0.37 3064
DoS 0.90 0.89 0.90 10000
accuracy 0.75 73064
macro avg 0.74 0.74 0.73 73064
weighted avg 0.78 0.75 0.76 73064
ConfusionMatrixDisplay.from_predictions(y_test_final, y_pred, display_labels=label2id, cmap='Blues', normalize='true', xticks_rotation=45)
plt.show()
From our analysis, the decision tree model performs the best out of all the non-linear classifiers
In this section of the project, we train ensemble classifiers and will evaluate it using the same metrics described above.
Some of the following code is a repetition from before, this is because originally we were using multiple notebooks and preferred to keep the structure to run single sections if needed.
!pip3 install xgboost
Collecting xgboost
Obtaining dependency information for xgboost from https://files.pythonhosted.org/packages/bc/43/242432efc3f60052a4a534dc4926b21e236ab4ec8d4920c593da3f65c65d/xgboost-2.0.2-py3-none-win_amd64.whl.metadata
Downloading xgboost-2.0.2-py3-none-win_amd64.whl.metadata (2.0 kB)
Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (from xgboost) (1.24.3)
Requirement already satisfied: scipy in c:\programdata\anaconda3\lib\site-packages (from xgboost) (1.10.1)
Downloading xgboost-2.0.2-py3-none-win_amd64.whl (99.8 MB)
---------------------------------------- 0.0/99.8 MB ? eta -:--:--
---------------------------------------- 0.0/99.8 MB 640.0 kB/s eta 0:02:36
---------------------------------------- 0.1/99.8 MB 1.3 MB/s eta 0:01:17
---------------------------------------- 0.2/99.8 MB 1.8 MB/s eta 0:00:56
---------------------------------------- 0.3/99.8 MB 1.9 MB/s eta 0:00:53
---------------------------------------- 0.4/99.8 MB 1.7 MB/s eta 0:00:58
---------------------------------------- 0.5/99.8 MB 1.9 MB/s eta 0:00:53
---------------------------------------- 0.6/99.8 MB 2.0 MB/s eta 0:00:49
---------------------------------------- 0.7/99.8 MB 2.1 MB/s eta 0:00:48
---------------------------------------- 0.8/99.8 MB 2.1 MB/s eta 0:00:47
---------------------------------------- 1.0/99.8 MB 2.2 MB/s eta 0:00:46
---------------------------------------- 1.1/99.8 MB 2.2 MB/s eta 0:00:45
---------------------------------------- 1.2/99.8 MB 2.3 MB/s eta 0:00:44
--------------------------------------- 1.3/99.8 MB 2.2 MB/s eta 0:00:45
--------------------------------------- 1.4/99.8 MB 2.3 MB/s eta 0:00:44
--------------------------------------- 1.6/99.8 MB 2.4 MB/s eta 0:00:42
--------------------------------------- 1.7/99.8 MB 2.3 MB/s eta 0:00:43
--------------------------------------- 1.7/99.8 MB 2.2 MB/s eta 0:00:44
--------------------------------------- 1.9/99.8 MB 2.3 MB/s eta 0:00:44
--------------------------------------- 2.0/99.8 MB 2.3 MB/s eta 0:00:42
--------------------------------------- 2.1/99.8 MB 2.3 MB/s eta 0:00:42
--------------------------------------- 2.2/99.8 MB 2.3 MB/s eta 0:00:43
--------------------------------------- 2.3/99.8 MB 2.3 MB/s eta 0:00:43
--------------------------------------- 2.4/99.8 MB 2.3 MB/s eta 0:00:43
- -------------------------------------- 2.6/99.8 MB 2.3 MB/s eta 0:00:42
- -------------------------------------- 2.7/99.8 MB 2.3 MB/s eta 0:00:42
- -------------------------------------- 2.8/99.8 MB 2.4 MB/s eta 0:00:41
- -------------------------------------- 3.0/99.8 MB 2.4 MB/s eta 0:00:41
- -------------------------------------- 3.1/99.8 MB 2.4 MB/s eta 0:00:40
- -------------------------------------- 3.2/99.8 MB 2.4 MB/s eta 0:00:40
- -------------------------------------- 3.3/99.8 MB 2.4 MB/s eta 0:00:41
- -------------------------------------- 3.5/99.8 MB 2.4 MB/s eta 0:00:40
- -------------------------------------- 3.6/99.8 MB 2.4 MB/s eta 0:00:40
- -------------------------------------- 3.7/99.8 MB 2.4 MB/s eta 0:00:40
- -------------------------------------- 3.8/99.8 MB 2.5 MB/s eta 0:00:39
- -------------------------------------- 4.0/99.8 MB 2.5 MB/s eta 0:00:39
- -------------------------------------- 4.1/99.8 MB 2.5 MB/s eta 0:00:39
- -------------------------------------- 4.2/99.8 MB 2.4 MB/s eta 0:00:40
- -------------------------------------- 4.3/99.8 MB 2.5 MB/s eta 0:00:39
- -------------------------------------- 4.4/99.8 MB 2.5 MB/s eta 0:00:39
- -------------------------------------- 4.6/99.8 MB 2.5 MB/s eta 0:00:39
- -------------------------------------- 4.7/99.8 MB 2.5 MB/s eta 0:00:39
- -------------------------------------- 4.8/99.8 MB 2.5 MB/s eta 0:00:39
- -------------------------------------- 4.9/99.8 MB 2.5 MB/s eta 0:00:39
-- ------------------------------------- 5.0/99.8 MB 2.5 MB/s eta 0:00:39
-- ------------------------------------- 5.1/99.8 MB 2.5 MB/s eta 0:00:39
-- ------------------------------------- 5.2/99.8 MB 2.5 MB/s eta 0:00:39
-- ------------------------------------- 5.3/99.8 MB 2.4 MB/s eta 0:00:39
-- ------------------------------------- 5.4/99.8 MB 2.4 MB/s eta 0:00:39
-- ------------------------------------- 5.5/99.8 MB 2.4 MB/s eta 0:00:39
-- ------------------------------------- 5.7/99.8 MB 2.4 MB/s eta 0:00:39
-- ------------------------------------- 5.7/99.8 MB 2.4 MB/s eta 0:00:39
-- ------------------------------------- 5.7/99.8 MB 2.4 MB/s eta 0:00:40
-- ------------------------------------- 5.9/99.8 MB 2.4 MB/s eta 0:00:40
-- ------------------------------------- 6.0/99.8 MB 2.4 MB/s eta 0:00:40
-- ------------------------------------- 6.1/99.8 MB 2.4 MB/s eta 0:00:40
-- ------------------------------------- 6.2/99.8 MB 2.4 MB/s eta 0:00:39
-- ------------------------------------- 6.4/99.8 MB 2.4 MB/s eta 0:00:39
-- ------------------------------------- 6.5/99.8 MB 2.4 MB/s eta 0:00:39
-- ------------------------------------- 6.6/99.8 MB 2.4 MB/s eta 0:00:39
-- ------------------------------------- 6.8/99.8 MB 2.4 MB/s eta 0:00:39
-- ------------------------------------- 6.9/99.8 MB 2.4 MB/s eta 0:00:39
-- ------------------------------------- 7.0/99.8 MB 2.5 MB/s eta 0:00:38
-- ------------------------------------- 7.2/99.8 MB 2.5 MB/s eta 0:00:38
-- ------------------------------------- 7.3/99.8 MB 2.5 MB/s eta 0:00:38
-- ------------------------------------- 7.4/99.8 MB 2.5 MB/s eta 0:00:38
--- ------------------------------------ 7.6/99.8 MB 2.5 MB/s eta 0:00:38
--- ------------------------------------ 7.7/99.8 MB 2.5 MB/s eta 0:00:37
--- ------------------------------------ 7.8/99.8 MB 2.5 MB/s eta 0:00:37
--- ------------------------------------ 8.0/99.8 MB 2.5 MB/s eta 0:00:37
--- ------------------------------------ 8.1/99.8 MB 2.5 MB/s eta 0:00:37
--- ------------------------------------ 8.2/99.8 MB 2.5 MB/s eta 0:00:37
--- ------------------------------------ 8.4/99.8 MB 2.5 MB/s eta 0:00:37
--- ------------------------------------ 8.5/99.8 MB 2.5 MB/s eta 0:00:37
--- ------------------------------------ 8.6/99.8 MB 2.5 MB/s eta 0:00:37
--- ------------------------------------ 8.8/99.8 MB 2.5 MB/s eta 0:00:36
--- ------------------------------------ 8.9/99.8 MB 2.5 MB/s eta 0:00:36
--- ------------------------------------ 9.1/99.8 MB 2.5 MB/s eta 0:00:36
--- ------------------------------------ 9.2/99.8 MB 2.5 MB/s eta 0:00:36
--- ------------------------------------ 9.3/99.8 MB 2.5 MB/s eta 0:00:36
--- ------------------------------------ 9.5/99.8 MB 2.5 MB/s eta 0:00:36
--- ------------------------------------ 9.6/99.8 MB 2.6 MB/s eta 0:00:36
--- ------------------------------------ 9.7/99.8 MB 2.6 MB/s eta 0:00:36
--- ------------------------------------ 9.9/99.8 MB 2.6 MB/s eta 0:00:35
---- ----------------------------------- 10.0/99.8 MB 2.6 MB/s eta 0:00:35
---- ----------------------------------- 10.1/99.8 MB 2.6 MB/s eta 0:00:35
---- ----------------------------------- 10.3/99.8 MB 2.6 MB/s eta 0:00:35
---- ----------------------------------- 10.4/99.8 MB 2.6 MB/s eta 0:00:35
---- ----------------------------------- 10.6/99.8 MB 2.6 MB/s eta 0:00:34
---- ----------------------------------- 10.7/99.8 MB 2.6 MB/s eta 0:00:34
---- ----------------------------------- 10.9/99.8 MB 2.7 MB/s eta 0:00:34
---- ----------------------------------- 10.9/99.8 MB 2.7 MB/s eta 0:00:34
---- ----------------------------------- 10.9/99.8 MB 2.7 MB/s eta 0:00:34
---- ----------------------------------- 10.9/99.8 MB 2.6 MB/s eta 0:00:35
---- ----------------------------------- 11.0/99.8 MB 2.6 MB/s eta 0:00:35
---- ----------------------------------- 11.1/99.8 MB 2.6 MB/s eta 0:00:35
---- ----------------------------------- 11.3/99.8 MB 2.6 MB/s eta 0:00:35
---- ----------------------------------- 11.4/99.8 MB 2.6 MB/s eta 0:00:35
---- ----------------------------------- 11.5/99.8 MB 2.6 MB/s eta 0:00:35
---- ----------------------------------- 11.7/99.8 MB 2.6 MB/s eta 0:00:34
---- ----------------------------------- 11.8/99.8 MB 2.6 MB/s eta 0:00:34
---- ----------------------------------- 12.0/99.8 MB 2.6 MB/s eta 0:00:34
---- ----------------------------------- 12.0/99.8 MB 2.6 MB/s eta 0:00:34
---- ----------------------------------- 12.2/99.8 MB 2.6 MB/s eta 0:00:34
---- ----------------------------------- 12.3/99.8 MB 2.6 MB/s eta 0:00:34
---- ----------------------------------- 12.4/99.8 MB 2.6 MB/s eta 0:00:34
----- ---------------------------------- 12.6/99.8 MB 2.6 MB/s eta 0:00:34
----- ---------------------------------- 12.7/99.8 MB 2.6 MB/s eta 0:00:34
----- ---------------------------------- 12.8/99.8 MB 2.6 MB/s eta 0:00:34
----- ---------------------------------- 12.8/99.8 MB 2.6 MB/s eta 0:00:34
----- ---------------------------------- 12.8/99.8 MB 2.6 MB/s eta 0:00:34
----- ---------------------------------- 12.8/99.8 MB 2.5 MB/s eta 0:00:35
----- ---------------------------------- 12.9/99.8 MB 2.5 MB/s eta 0:00:35
----- ---------------------------------- 13.1/99.8 MB 2.5 MB/s eta 0:00:35
----- ---------------------------------- 13.1/99.8 MB 2.5 MB/s eta 0:00:35
----- ---------------------------------- 13.3/99.8 MB 2.5 MB/s eta 0:00:35
----- ---------------------------------- 13.4/99.8 MB 2.5 MB/s eta 0:00:35
----- ---------------------------------- 13.5/99.8 MB 2.5 MB/s eta 0:00:35
----- ---------------------------------- 13.7/99.8 MB 2.5 MB/s eta 0:00:35
----- ---------------------------------- 13.8/99.8 MB 2.5 MB/s eta 0:00:35
----- ---------------------------------- 13.9/99.8 MB 2.5 MB/s eta 0:00:35
----- ---------------------------------- 14.1/99.8 MB 2.5 MB/s eta 0:00:34
----- ---------------------------------- 14.2/99.8 MB 2.5 MB/s eta 0:00:34
----- ---------------------------------- 14.3/99.8 MB 2.5 MB/s eta 0:00:34
----- ---------------------------------- 14.4/99.8 MB 2.5 MB/s eta 0:00:34
----- ---------------------------------- 14.6/99.8 MB 2.5 MB/s eta 0:00:34
----- ---------------------------------- 14.7/99.8 MB 2.5 MB/s eta 0:00:34
----- ---------------------------------- 14.8/99.8 MB 2.5 MB/s eta 0:00:34
------ --------------------------------- 15.0/99.8 MB 2.6 MB/s eta 0:00:34
------ --------------------------------- 15.1/99.8 MB 2.5 MB/s eta 0:00:34
------ --------------------------------- 15.2/99.8 MB 2.5 MB/s eta 0:00:34
------ --------------------------------- 15.3/99.8 MB 2.5 MB/s eta 0:00:34
------ --------------------------------- 15.4/99.8 MB 2.6 MB/s eta 0:00:33
------ --------------------------------- 15.5/99.8 MB 2.6 MB/s eta 0:00:33
------ --------------------------------- 15.6/99.8 MB 2.6 MB/s eta 0:00:33
------ --------------------------------- 15.7/99.8 MB 2.5 MB/s eta 0:00:33
------ --------------------------------- 15.9/99.8 MB 2.6 MB/s eta 0:00:33
------ --------------------------------- 16.0/99.8 MB 2.6 MB/s eta 0:00:33
------ --------------------------------- 16.1/99.8 MB 2.6 MB/s eta 0:00:33
------ --------------------------------- 16.2/99.8 MB 2.6 MB/s eta 0:00:33
------ --------------------------------- 16.2/99.8 MB 2.6 MB/s eta 0:00:33
------ --------------------------------- 16.3/99.8 MB 2.5 MB/s eta 0:00:33
------ --------------------------------- 16.3/99.8 MB 2.5 MB/s eta 0:00:33
------ --------------------------------- 16.4/99.8 MB 2.5 MB/s eta 0:00:34
------ --------------------------------- 16.6/99.8 MB 2.5 MB/s eta 0:00:33
------ --------------------------------- 16.8/99.8 MB 2.5 MB/s eta 0:00:33
------ --------------------------------- 16.8/99.8 MB 2.5 MB/s eta 0:00:33
------ --------------------------------- 17.0/99.8 MB 2.5 MB/s eta 0:00:33
------ --------------------------------- 17.1/99.8 MB 2.5 MB/s eta 0:00:33
------ --------------------------------- 17.3/99.8 MB 2.5 MB/s eta 0:00:33
------ --------------------------------- 17.4/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 17.5/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 17.6/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 17.8/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 17.8/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 17.9/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 18.0/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 18.2/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 18.3/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 18.5/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 18.6/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 18.7/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 18.8/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 18.9/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 18.9/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 18.9/99.8 MB 2.5 MB/s eta 0:00:33
------- -------------------------------- 18.9/99.8 MB 2.4 MB/s eta 0:00:34
------- -------------------------------- 18.9/99.8 MB 2.4 MB/s eta 0:00:35
------- -------------------------------- 19.0/99.8 MB 2.4 MB/s eta 0:00:35
------- -------------------------------- 19.1/99.8 MB 2.4 MB/s eta 0:00:35
------- -------------------------------- 19.2/99.8 MB 2.4 MB/s eta 0:00:35
------- -------------------------------- 19.4/99.8 MB 2.4 MB/s eta 0:00:35
------- -------------------------------- 19.5/99.8 MB 2.3 MB/s eta 0:00:35
------- -------------------------------- 19.6/99.8 MB 2.4 MB/s eta 0:00:35
------- -------------------------------- 19.8/99.8 MB 2.3 MB/s eta 0:00:35
------- -------------------------------- 19.9/99.8 MB 2.3 MB/s eta 0:00:35
-------- ------------------------------- 20.1/99.8 MB 2.3 MB/s eta 0:00:35
-------- ------------------------------- 20.2/99.8 MB 2.3 MB/s eta 0:00:35
-------- ------------------------------- 20.3/99.8 MB 2.3 MB/s eta 0:00:35
-------- ------------------------------- 20.3/99.8 MB 2.3 MB/s eta 0:00:35
-------- ------------------------------- 20.4/99.8 MB 2.3 MB/s eta 0:00:35
-------- ------------------------------- 20.5/99.8 MB 2.3 MB/s eta 0:00:35
-------- ------------------------------- 20.7/99.8 MB 2.3 MB/s eta 0:00:35
-------- ------------------------------- 20.8/99.8 MB 2.3 MB/s eta 0:00:35
-------- ------------------------------- 20.8/99.8 MB 2.3 MB/s eta 0:00:35
-------- ------------------------------- 21.0/99.8 MB 2.3 MB/s eta 0:00:35
-------- ------------------------------- 21.1/99.8 MB 2.3 MB/s eta 0:00:34
-------- ------------------------------- 21.2/99.8 MB 2.3 MB/s eta 0:00:34
-------- ------------------------------- 21.4/99.8 MB 2.3 MB/s eta 0:00:34
-------- ------------------------------- 21.5/99.8 MB 2.3 MB/s eta 0:00:34
-------- ------------------------------- 21.6/99.8 MB 2.3 MB/s eta 0:00:34
-------- ------------------------------- 21.8/99.8 MB 2.3 MB/s eta 0:00:34
-------- ------------------------------- 21.9/99.8 MB 2.3 MB/s eta 0:00:34
-------- ------------------------------- 22.0/99.8 MB 2.3 MB/s eta 0:00:34
-------- ------------------------------- 22.1/99.8 MB 2.3 MB/s eta 0:00:34
-------- ------------------------------- 22.2/99.8 MB 2.3 MB/s eta 0:00:34
-------- ------------------------------- 22.3/99.8 MB 2.3 MB/s eta 0:00:34
-------- ------------------------------- 22.4/99.8 MB 2.3 MB/s eta 0:00:34
--------- ------------------------------ 22.5/99.8 MB 2.3 MB/s eta 0:00:34
--------- ------------------------------ 22.7/99.8 MB 2.3 MB/s eta 0:00:34
--------- ------------------------------ 22.8/99.8 MB 2.3 MB/s eta 0:00:34
--------- ------------------------------ 22.9/99.8 MB 2.3 MB/s eta 0:00:34
--------- ------------------------------ 23.0/99.8 MB 2.4 MB/s eta 0:00:33
--------- ------------------------------ 23.2/99.8 MB 2.4 MB/s eta 0:00:33
--------- ------------------------------ 23.3/99.8 MB 2.4 MB/s eta 0:00:33
--------- ------------------------------ 23.4/99.8 MB 2.4 MB/s eta 0:00:33
--------- ------------------------------ 23.5/99.8 MB 2.4 MB/s eta 0:00:32
--------- ------------------------------ 23.6/99.8 MB 2.4 MB/s eta 0:00:33
--------- ------------------------------ 23.7/99.8 MB 2.4 MB/s eta 0:00:33
--------- ------------------------------ 23.8/99.8 MB 2.4 MB/s eta 0:00:33
--------- ------------------------------ 23.9/99.8 MB 2.4 MB/s eta 0:00:33
--------- ------------------------------ 24.1/99.8 MB 2.3 MB/s eta 0:00:33
--------- ------------------------------ 24.2/99.8 MB 2.3 MB/s eta 0:00:33
--------- ------------------------------ 24.3/99.8 MB 2.4 MB/s eta 0:00:33
--------- ------------------------------ 24.4/99.8 MB 2.3 MB/s eta 0:00:33
--------- ------------------------------ 24.6/99.8 MB 2.3 MB/s eta 0:00:33
--------- ------------------------------ 24.7/99.8 MB 2.3 MB/s eta 0:00:33
--------- ------------------------------ 24.8/99.8 MB 2.3 MB/s eta 0:00:32
---------- ----------------------------- 25.0/99.8 MB 2.3 MB/s eta 0:00:32
---------- ----------------------------- 25.1/99.8 MB 2.4 MB/s eta 0:00:32
---------- ----------------------------- 25.2/99.8 MB 2.3 MB/s eta 0:00:32
---------- ----------------------------- 25.3/99.8 MB 2.4 MB/s eta 0:00:32
---------- ----------------------------- 25.4/99.8 MB 2.4 MB/s eta 0:00:32
---------- ----------------------------- 25.6/99.8 MB 2.4 MB/s eta 0:00:32
---------- ----------------------------- 25.6/99.8 MB 2.3 MB/s eta 0:00:32
---------- ----------------------------- 25.8/99.8 MB 2.3 MB/s eta 0:00:32
---------- ----------------------------- 25.9/99.8 MB 2.3 MB/s eta 0:00:32
---------- ----------------------------- 26.0/99.8 MB 2.4 MB/s eta 0:00:32
---------- ----------------------------- 26.1/99.8 MB 2.4 MB/s eta 0:00:32
---------- ----------------------------- 26.2/99.8 MB 2.3 MB/s eta 0:00:32
---------- ----------------------------- 26.3/99.8 MB 2.3 MB/s eta 0:00:32
---------- ----------------------------- 26.4/99.8 MB 2.3 MB/s eta 0:00:32
---------- ----------------------------- 26.5/99.8 MB 2.4 MB/s eta 0:00:31
---------- ----------------------------- 26.7/99.8 MB 2.4 MB/s eta 0:00:31
---------- ----------------------------- 26.8/99.8 MB 2.4 MB/s eta 0:00:31
---------- ----------------------------- 26.9/99.8 MB 2.4 MB/s eta 0:00:31
---------- ----------------------------- 27.0/99.8 MB 2.4 MB/s eta 0:00:31
---------- ----------------------------- 27.1/99.8 MB 2.4 MB/s eta 0:00:31
---------- ----------------------------- 27.3/99.8 MB 2.4 MB/s eta 0:00:31
---------- ----------------------------- 27.4/99.8 MB 2.4 MB/s eta 0:00:31
----------- ---------------------------- 27.5/99.8 MB 2.4 MB/s eta 0:00:31
----------- ---------------------------- 27.6/99.8 MB 2.4 MB/s eta 0:00:31
----------- ---------------------------- 27.8/99.8 MB 2.4 MB/s eta 0:00:31
----------- ---------------------------- 27.9/99.8 MB 2.4 MB/s eta 0:00:31
----------- ---------------------------- 27.9/99.8 MB 2.4 MB/s eta 0:00:31
----------- ---------------------------- 28.0/99.8 MB 2.3 MB/s eta 0:00:31
----------- ---------------------------- 28.1/99.8 MB 2.3 MB/s eta 0:00:31
----------- ---------------------------- 28.2/99.8 MB 2.3 MB/s eta 0:00:31
----------- ---------------------------- 28.3/99.8 MB 2.4 MB/s eta 0:00:31
----------- ---------------------------- 28.3/99.8 MB 2.3 MB/s eta 0:00:31
----------- ---------------------------- 28.5/99.8 MB 2.3 MB/s eta 0:00:31
----------- ---------------------------- 28.6/99.8 MB 2.3 MB/s eta 0:00:31
----------- ---------------------------- 28.7/99.8 MB 2.3 MB/s eta 0:00:31
----------- ---------------------------- 28.8/99.8 MB 2.3 MB/s eta 0:00:31
----------- ---------------------------- 28.9/99.8 MB 2.3 MB/s eta 0:00:31
----------- ---------------------------- 29.0/99.8 MB 2.3 MB/s eta 0:00:31
----------- ---------------------------- 29.1/99.8 MB 2.3 MB/s eta 0:00:31
----------- ---------------------------- 29.2/99.8 MB 2.4 MB/s eta 0:00:30
----------- ---------------------------- 29.3/99.8 MB 2.4 MB/s eta 0:00:30
----------- ---------------------------- 29.4/99.8 MB 2.4 MB/s eta 0:00:30
----------- ---------------------------- 29.5/99.8 MB 2.4 MB/s eta 0:00:30
----------- ---------------------------- 29.6/99.8 MB 2.4 MB/s eta 0:00:30
----------- ---------------------------- 29.7/99.8 MB 2.4 MB/s eta 0:00:30
----------- ---------------------------- 29.8/99.8 MB 2.4 MB/s eta 0:00:30
------------ --------------------------- 29.9/99.8 MB 2.4 MB/s eta 0:00:30
------------ --------------------------- 30.0/99.8 MB 2.4 MB/s eta 0:00:30
------------ --------------------------- 30.1/99.8 MB 2.4 MB/s eta 0:00:30
------------ --------------------------- 30.2/99.8 MB 2.4 MB/s eta 0:00:30
------------ --------------------------- 30.3/99.8 MB 2.4 MB/s eta 0:00:30
------------ --------------------------- 30.4/99.8 MB 2.4 MB/s eta 0:00:30
------------ --------------------------- 30.5/99.8 MB 2.4 MB/s eta 0:00:30
------------ --------------------------- 30.6/99.8 MB 2.4 MB/s eta 0:00:29
------------ --------------------------- 30.7/99.8 MB 2.4 MB/s eta 0:00:29
------------ --------------------------- 30.8/99.8 MB 2.4 MB/s eta 0:00:30
------------ --------------------------- 30.9/99.8 MB 2.4 MB/s eta 0:00:29
------------ --------------------------- 31.1/99.8 MB 2.4 MB/s eta 0:00:29
------------ --------------------------- 31.2/99.8 MB 2.4 MB/s eta 0:00:29
------------ --------------------------- 31.3/99.8 MB 2.4 MB/s eta 0:00:29
------------ --------------------------- 31.5/99.8 MB 2.4 MB/s eta 0:00:29
------------ --------------------------- 31.6/99.8 MB 2.4 MB/s eta 0:00:29
------------ --------------------------- 31.7/99.8 MB 2.4 MB/s eta 0:00:29
------------ --------------------------- 31.8/99.8 MB 2.4 MB/s eta 0:00:29
------------ --------------------------- 31.9/99.8 MB 2.4 MB/s eta 0:00:29
------------ --------------------------- 32.1/99.8 MB 2.4 MB/s eta 0:00:29
------------ --------------------------- 32.2/99.8 MB 2.4 MB/s eta 0:00:28
------------ --------------------------- 32.4/99.8 MB 2.4 MB/s eta 0:00:28
------------- -------------------------- 32.5/99.8 MB 2.4 MB/s eta 0:00:28
------------- -------------------------- 32.7/99.8 MB 2.5 MB/s eta 0:00:28
------------- -------------------------- 32.8/99.8 MB 2.4 MB/s eta 0:00:28
------------- -------------------------- 32.9/99.8 MB 2.4 MB/s eta 0:00:28
------------- -------------------------- 33.1/99.8 MB 2.4 MB/s eta 0:00:28
------------- -------------------------- 33.2/99.8 MB 2.4 MB/s eta 0:00:28
------------- -------------------------- 33.3/99.8 MB 2.4 MB/s eta 0:00:28
------------- -------------------------- 33.5/99.8 MB 2.5 MB/s eta 0:00:28
------------- -------------------------- 33.6/99.8 MB 2.5 MB/s eta 0:00:27
------------- -------------------------- 33.7/99.8 MB 2.5 MB/s eta 0:00:27
------------- -------------------------- 33.8/99.8 MB 2.5 MB/s eta 0:00:27
------------- -------------------------- 34.0/99.8 MB 2.5 MB/s eta 0:00:27
------------- -------------------------- 34.1/99.8 MB 2.5 MB/s eta 0:00:27
------------- -------------------------- 34.2/99.8 MB 2.5 MB/s eta 0:00:27
------------- -------------------------- 34.4/99.8 MB 2.5 MB/s eta 0:00:27
------------- -------------------------- 34.5/99.8 MB 2.5 MB/s eta 0:00:27
------------- -------------------------- 34.6/99.8 MB 2.5 MB/s eta 0:00:27
------------- -------------------------- 34.7/99.8 MB 2.5 MB/s eta 0:00:27
------------- -------------------------- 34.9/99.8 MB 2.5 MB/s eta 0:00:27
-------------- ------------------------- 34.9/99.8 MB 2.5 MB/s eta 0:00:27
-------------- ------------------------- 35.1/99.8 MB 2.5 MB/s eta 0:00:27
-------------- ------------------------- 35.2/99.8 MB 2.5 MB/s eta 0:00:27
-------------- ------------------------- 35.3/99.8 MB 2.5 MB/s eta 0:00:27
-------------- ------------------------- 35.5/99.8 MB 2.5 MB/s eta 0:00:26
-------------- ------------------------- 35.6/99.8 MB 2.5 MB/s eta 0:00:26
-------------- ------------------------- 35.7/99.8 MB 2.5 MB/s eta 0:00:26
-------------- ------------------------- 35.9/99.8 MB 2.5 MB/s eta 0:00:26
-------------- ------------------------- 35.9/99.8 MB 2.5 MB/s eta 0:00:26
-------------- ------------------------- 36.1/99.8 MB 2.5 MB/s eta 0:00:26
-------------- ------------------------- 36.2/99.8 MB 2.5 MB/s eta 0:00:26
-------------- ------------------------- 36.4/99.8 MB 2.5 MB/s eta 0:00:26
-------------- ------------------------- 36.5/99.8 MB 2.5 MB/s eta 0:00:26
-------------- ------------------------- 36.6/99.8 MB 2.5 MB/s eta 0:00:26
-------------- ------------------------- 36.7/99.8 MB 2.5 MB/s eta 0:00:26
-------------- ------------------------- 36.9/99.8 MB 2.5 MB/s eta 0:00:26
-------------- ------------------------- 37.0/99.8 MB 2.5 MB/s eta 0:00:25
-------------- ------------------------- 37.1/99.8 MB 2.5 MB/s eta 0:00:25
-------------- ------------------------- 37.3/99.8 MB 2.5 MB/s eta 0:00:25
-------------- ------------------------- 37.4/99.8 MB 2.5 MB/s eta 0:00:25
--------------- ------------------------ 37.5/99.8 MB 2.5 MB/s eta 0:00:25
--------------- ------------------------ 37.6/99.8 MB 2.5 MB/s eta 0:00:25
--------------- ------------------------ 37.8/99.8 MB 2.5 MB/s eta 0:00:25
--------------- ------------------------ 37.9/99.8 MB 2.5 MB/s eta 0:00:25
--------------- ------------------------ 38.0/99.8 MB 2.5 MB/s eta 0:00:25
--------------- ------------------------ 38.1/99.8 MB 2.6 MB/s eta 0:00:25
--------------- ------------------------ 38.1/99.8 MB 2.6 MB/s eta 0:00:25
--------------- ------------------------ 38.2/99.8 MB 2.5 MB/s eta 0:00:25
--------------- ------------------------ 38.3/99.8 MB 2.5 MB/s eta 0:00:25
--------------- ------------------------ 38.4/99.8 MB 2.5 MB/s eta 0:00:25
--------------- ------------------------ 38.5/99.8 MB 2.5 MB/s eta 0:00:25
--------------- ------------------------ 38.6/99.8 MB 2.5 MB/s eta 0:00:25
--------------- ------------------------ 38.7/99.8 MB 2.5 MB/s eta 0:00:25
--------------- ------------------------ 38.9/99.8 MB 2.5 MB/s eta 0:00:24
--------------- ------------------------ 39.0/99.8 MB 2.5 MB/s eta 0:00:24
--------------- ------------------------ 39.1/99.8 MB 2.6 MB/s eta 0:00:24
--------------- ------------------------ 39.3/99.8 MB 2.6 MB/s eta 0:00:24
--------------- ------------------------ 39.4/99.8 MB 2.6 MB/s eta 0:00:24
--------------- ------------------------ 39.5/99.8 MB 2.6 MB/s eta 0:00:24
--------------- ------------------------ 39.7/99.8 MB 2.6 MB/s eta 0:00:24
--------------- ------------------------ 39.8/99.8 MB 2.6 MB/s eta 0:00:24
--------------- ------------------------ 39.9/99.8 MB 2.6 MB/s eta 0:00:24
---------------- ----------------------- 40.0/99.8 MB 2.6 MB/s eta 0:00:23
---------------- ----------------------- 40.2/99.8 MB 2.6 MB/s eta 0:00:23
---------------- ----------------------- 40.3/99.8 MB 2.6 MB/s eta 0:00:23
---------------- ----------------------- 40.4/99.8 MB 2.7 MB/s eta 0:00:23
---------------- ----------------------- 40.6/99.8 MB 2.7 MB/s eta 0:00:23
---------------- ----------------------- 40.7/99.8 MB 2.7 MB/s eta 0:00:23
---------------- ----------------------- 40.8/99.8 MB 2.7 MB/s eta 0:00:23
---------------- ----------------------- 41.0/99.8 MB 2.7 MB/s eta 0:00:22
---------------- ----------------------- 41.1/99.8 MB 2.7 MB/s eta 0:00:22
---------------- ----------------------- 41.2/99.8 MB 2.7 MB/s eta 0:00:22
---------------- ----------------------- 41.3/99.8 MB 2.7 MB/s eta 0:00:22
---------------- ----------------------- 41.5/99.8 MB 2.7 MB/s eta 0:00:22
---------------- ----------------------- 41.6/99.8 MB 2.7 MB/s eta 0:00:22
---------------- ----------------------- 41.7/99.8 MB 2.7 MB/s eta 0:00:22
---------------- ----------------------- 41.9/99.8 MB 2.7 MB/s eta 0:00:22
---------------- ----------------------- 42.0/99.8 MB 2.7 MB/s eta 0:00:22
---------------- ----------------------- 42.1/99.8 MB 2.7 MB/s eta 0:00:22
---------------- ----------------------- 42.2/99.8 MB 2.7 MB/s eta 0:00:22
---------------- ----------------------- 42.4/99.8 MB 2.7 MB/s eta 0:00:22
----------------- ---------------------- 42.5/99.8 MB 2.7 MB/s eta 0:00:22
----------------- ---------------------- 42.7/99.8 MB 2.7 MB/s eta 0:00:22
----------------- ---------------------- 42.7/99.8 MB 2.7 MB/s eta 0:00:22
----------------- ---------------------- 42.8/99.8 MB 2.7 MB/s eta 0:00:22
----------------- ---------------------- 43.0/99.8 MB 2.7 MB/s eta 0:00:22
----------------- ---------------------- 43.1/99.8 MB 2.7 MB/s eta 0:00:22
----------------- ---------------------- 43.1/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 43.3/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 43.4/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 43.5/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 43.6/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 43.8/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 44.0/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 44.0/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 44.2/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 44.3/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 44.4/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 44.5/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 44.6/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 44.7/99.8 MB 2.6 MB/s eta 0:00:22
----------------- ---------------------- 44.8/99.8 MB 2.6 MB/s eta 0:00:22
------------------ --------------------- 45.0/99.8 MB 2.6 MB/s eta 0:00:22
------------------ --------------------- 45.1/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 45.2/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 45.4/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 45.5/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 45.6/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 45.7/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 45.8/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 45.9/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 45.9/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 46.1/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 46.2/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 46.3/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 46.4/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 46.5/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 46.6/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 46.7/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 46.9/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 47.0/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 47.1/99.8 MB 2.6 MB/s eta 0:00:21
------------------ --------------------- 47.2/99.8 MB 2.5 MB/s eta 0:00:21
------------------ --------------------- 47.3/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 47.4/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 47.5/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 47.6/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 47.7/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 47.9/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 48.0/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 48.1/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 48.1/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 48.1/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 48.2/99.8 MB 2.5 MB/s eta 0:00:22
------------------- -------------------- 48.2/99.8 MB 2.5 MB/s eta 0:00:22
------------------- -------------------- 48.4/99.8 MB 2.4 MB/s eta 0:00:22
------------------- -------------------- 48.5/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 48.5/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 48.7/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 48.8/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 48.9/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 49.1/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 49.1/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 49.2/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 49.4/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 49.5/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 49.6/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 49.8/99.8 MB 2.5 MB/s eta 0:00:21
------------------- -------------------- 49.9/99.8 MB 2.5 MB/s eta 0:00:21
-------------------- ------------------- 50.0/99.8 MB 2.5 MB/s eta 0:00:21
-------------------- ------------------- 50.1/99.8 MB 2.5 MB/s eta 0:00:21
-------------------- ------------------- 50.3/99.8 MB 2.5 MB/s eta 0:00:21
-------------------- ------------------- 50.4/99.8 MB 2.5 MB/s eta 0:00:21
-------------------- ------------------- 50.4/99.8 MB 2.5 MB/s eta 0:00:21
-------------------- ------------------- 50.4/99.8 MB 2.5 MB/s eta 0:00:21
-------------------- ------------------- 50.4/99.8 MB 2.4 MB/s eta 0:00:21
-------------------- ------------------- 50.5/99.8 MB 2.4 MB/s eta 0:00:21
-------------------- ------------------- 50.6/99.8 MB 2.4 MB/s eta 0:00:21
-------------------- ------------------- 50.6/99.8 MB 2.3 MB/s eta 0:00:22
-------------------- ------------------- 50.7/99.8 MB 2.3 MB/s eta 0:00:22
-------------------- ------------------- 50.8/99.8 MB 2.3 MB/s eta 0:00:22
-------------------- ------------------- 50.9/99.8 MB 2.3 MB/s eta 0:00:22
-------------------- ------------------- 51.0/99.8 MB 2.3 MB/s eta 0:00:22
-------------------- ------------------- 51.1/99.8 MB 2.3 MB/s eta 0:00:22
-------------------- ------------------- 51.2/99.8 MB 2.3 MB/s eta 0:00:21
-------------------- ------------------- 51.2/99.8 MB 2.3 MB/s eta 0:00:22
-------------------- ------------------- 51.4/99.8 MB 2.3 MB/s eta 0:00:22
-------------------- ------------------- 51.4/99.8 MB 2.3 MB/s eta 0:00:22
-------------------- ------------------- 51.6/99.8 MB 2.3 MB/s eta 0:00:22
-------------------- ------------------- 51.7/99.8 MB 2.3 MB/s eta 0:00:22
-------------------- ------------------- 51.8/99.8 MB 2.3 MB/s eta 0:00:22
-------------------- ------------------- 51.9/99.8 MB 2.3 MB/s eta 0:00:22
-------------------- ------------------- 52.1/99.8 MB 2.3 MB/s eta 0:00:21
-------------------- ------------------- 52.2/99.8 MB 2.3 MB/s eta 0:00:21
-------------------- ------------------- 52.3/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 52.4/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 52.5/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 52.7/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 52.8/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 53.0/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 53.1/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 53.2/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 53.3/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 53.4/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 53.5/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 53.7/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 53.8/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 54.0/99.8 MB 2.3 MB/s eta 0:00:21
--------------------- ------------------ 54.1/99.8 MB 2.3 MB/s eta 0:00:20
--------------------- ------------------ 54.3/99.8 MB 2.3 MB/s eta 0:00:20
--------------------- ------------------ 54.4/99.8 MB 2.3 MB/s eta 0:00:20
--------------------- ------------------ 54.6/99.8 MB 2.3 MB/s eta 0:00:20
--------------------- ------------------ 54.7/99.8 MB 2.3 MB/s eta 0:00:20
--------------------- ------------------ 54.8/99.8 MB 2.3 MB/s eta 0:00:20
---------------------- ----------------- 55.0/99.8 MB 2.3 MB/s eta 0:00:20
---------------------- ----------------- 55.0/99.8 MB 2.3 MB/s eta 0:00:20
---------------------- ----------------- 55.1/99.8 MB 2.3 MB/s eta 0:00:20
---------------------- ----------------- 55.2/99.8 MB 2.3 MB/s eta 0:00:20
---------------------- ----------------- 55.3/99.8 MB 2.3 MB/s eta 0:00:20
---------------------- ----------------- 55.4/99.8 MB 2.3 MB/s eta 0:00:20
---------------------- ----------------- 55.4/99.8 MB 2.3 MB/s eta 0:00:20
---------------------- ----------------- 55.5/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 55.5/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 55.6/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 55.8/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 55.9/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.0/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.1/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.2/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.3/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.3/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.4/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.5/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.6/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.6/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.7/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.7/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.8/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.8/99.8 MB 2.2 MB/s eta 0:00:20
---------------------- ----------------- 56.9/99.8 MB 2.1 MB/s eta 0:00:21
---------------------- ----------------- 56.9/99.8 MB 2.1 MB/s eta 0:00:21
---------------------- ----------------- 57.0/99.8 MB 2.1 MB/s eta 0:00:21
---------------------- ----------------- 57.0/99.8 MB 2.1 MB/s eta 0:00:21
---------------------- ----------------- 57.0/99.8 MB 2.1 MB/s eta 0:00:21
---------------------- ----------------- 57.1/99.8 MB 2.1 MB/s eta 0:00:21
---------------------- ----------------- 57.1/99.8 MB 2.1 MB/s eta 0:00:21
---------------------- ----------------- 57.1/99.8 MB 2.1 MB/s eta 0:00:21
---------------------- ----------------- 57.1/99.8 MB 2.0 MB/s eta 0:00:22
---------------------- ----------------- 57.3/99.8 MB 2.0 MB/s eta 0:00:22
---------------------- ----------------- 57.4/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 57.4/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 57.5/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 57.5/99.8 MB 2.0 MB/s eta 0:00:22
----------------------- ---------------- 57.6/99.8 MB 2.0 MB/s eta 0:00:22
----------------------- ---------------- 57.7/99.8 MB 2.0 MB/s eta 0:00:22
----------------------- ---------------- 57.7/99.8 MB 2.0 MB/s eta 0:00:22
----------------------- ---------------- 57.8/99.8 MB 2.0 MB/s eta 0:00:22
----------------------- ---------------- 57.9/99.8 MB 2.0 MB/s eta 0:00:22
----------------------- ---------------- 57.9/99.8 MB 2.0 MB/s eta 0:00:22
----------------------- ---------------- 58.0/99.8 MB 2.0 MB/s eta 0:00:22
----------------------- ---------------- 58.1/99.8 MB 1.9 MB/s eta 0:00:22
----------------------- ---------------- 58.1/99.8 MB 1.9 MB/s eta 0:00:22
----------------------- ---------------- 58.2/99.8 MB 1.9 MB/s eta 0:00:22
----------------------- ---------------- 58.2/99.8 MB 1.9 MB/s eta 0:00:22
----------------------- ---------------- 58.3/99.8 MB 1.9 MB/s eta 0:00:22
----------------------- ---------------- 58.5/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 58.6/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 58.7/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 58.8/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 58.9/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 59.1/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 59.2/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 59.3/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 59.4/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 59.5/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 59.7/99.8 MB 2.0 MB/s eta 0:00:21
----------------------- ---------------- 59.8/99.8 MB 2.0 MB/s eta 0:00:21
------------------------ --------------- 59.9/99.8 MB 2.0 MB/s eta 0:00:21
------------------------ --------------- 60.0/99.8 MB 2.0 MB/s eta 0:00:21
------------------------ --------------- 60.1/99.8 MB 2.0 MB/s eta 0:00:21
------------------------ --------------- 60.2/99.8 MB 2.0 MB/s eta 0:00:21
------------------------ --------------- 60.3/99.8 MB 2.0 MB/s eta 0:00:21
------------------------ --------------- 60.4/99.8 MB 2.0 MB/s eta 0:00:21
------------------------ --------------- 60.6/99.8 MB 2.0 MB/s eta 0:00:20
------------------------ --------------- 60.7/99.8 MB 2.0 MB/s eta 0:00:20
------------------------ --------------- 60.8/99.8 MB 2.0 MB/s eta 0:00:20
------------------------ --------------- 60.9/99.8 MB 2.0 MB/s eta 0:00:20
------------------------ --------------- 61.0/99.8 MB 2.0 MB/s eta 0:00:19
------------------------ --------------- 61.2/99.8 MB 2.0 MB/s eta 0:00:19
------------------------ --------------- 61.3/99.8 MB 2.0 MB/s eta 0:00:19
------------------------ --------------- 61.4/99.8 MB 2.0 MB/s eta 0:00:19
------------------------ --------------- 61.5/99.8 MB 2.1 MB/s eta 0:00:19
------------------------ --------------- 61.7/99.8 MB 2.1 MB/s eta 0:00:19
------------------------ --------------- 61.8/99.8 MB 2.1 MB/s eta 0:00:19
------------------------ --------------- 62.0/99.8 MB 2.1 MB/s eta 0:00:19
------------------------ --------------- 62.1/99.8 MB 2.1 MB/s eta 0:00:19
------------------------ --------------- 62.2/99.8 MB 2.1 MB/s eta 0:00:18
------------------------- -------------- 62.4/99.8 MB 2.1 MB/s eta 0:00:18
------------------------- -------------- 62.5/99.8 MB 2.1 MB/s eta 0:00:18
------------------------- -------------- 62.6/99.8 MB 2.1 MB/s eta 0:00:18
------------------------- -------------- 62.6/99.8 MB 2.1 MB/s eta 0:00:18
------------------------- -------------- 62.7/99.8 MB 2.1 MB/s eta 0:00:18
------------------------- -------------- 62.8/99.8 MB 2.1 MB/s eta 0:00:18
------------------------- -------------- 62.8/99.8 MB 2.1 MB/s eta 0:00:18
------------------------- -------------- 62.9/99.8 MB 2.0 MB/s eta 0:00:19
------------------------- -------------- 63.0/99.8 MB 2.0 MB/s eta 0:00:19
------------------------- -------------- 63.0/99.8 MB 2.0 MB/s eta 0:00:19
------------------------- -------------- 63.2/99.8 MB 2.0 MB/s eta 0:00:19
------------------------- -------------- 63.3/99.8 MB 2.0 MB/s eta 0:00:19
------------------------- -------------- 63.4/99.8 MB 2.0 MB/s eta 0:00:18
------------------------- -------------- 63.5/99.8 MB 2.0 MB/s eta 0:00:18
------------------------- -------------- 63.6/99.8 MB 2.0 MB/s eta 0:00:18
------------------------- -------------- 63.8/99.8 MB 2.0 MB/s eta 0:00:18
------------------------- -------------- 63.9/99.8 MB 2.0 MB/s eta 0:00:18
------------------------- -------------- 64.0/99.8 MB 2.0 MB/s eta 0:00:18
------------------------- -------------- 64.1/99.8 MB 2.0 MB/s eta 0:00:18
------------------------- -------------- 64.2/99.8 MB 2.0 MB/s eta 0:00:18
------------------------- -------------- 64.4/99.8 MB 2.0 MB/s eta 0:00:18
------------------------- -------------- 64.5/99.8 MB 2.0 MB/s eta 0:00:18
------------------------- -------------- 64.7/99.8 MB 2.0 MB/s eta 0:00:18
------------------------- -------------- 64.8/99.8 MB 2.0 MB/s eta 0:00:18
-------------------------- ------------- 64.9/99.8 MB 2.0 MB/s eta 0:00:18
-------------------------- ------------- 65.0/99.8 MB 2.0 MB/s eta 0:00:18
-------------------------- ------------- 65.2/99.8 MB 2.0 MB/s eta 0:00:18
-------------------------- ------------- 65.3/99.8 MB 2.0 MB/s eta 0:00:18
-------------------------- ------------- 65.3/99.8 MB 2.0 MB/s eta 0:00:18
-------------------------- ------------- 65.3/99.8 MB 2.0 MB/s eta 0:00:18
-------------------------- ------------- 65.4/99.8 MB 2.0 MB/s eta 0:00:18
-------------------------- ------------- 65.5/99.8 MB 2.0 MB/s eta 0:00:18
-------------------------- ------------- 65.6/99.8 MB 2.0 MB/s eta 0:00:18
-------------------------- ------------- 65.8/99.8 MB 2.0 MB/s eta 0:00:17
-------------------------- ------------- 65.9/99.8 MB 2.0 MB/s eta 0:00:17
-------------------------- ------------- 66.0/99.8 MB 2.0 MB/s eta 0:00:17
-------------------------- ------------- 66.1/99.8 MB 2.0 MB/s eta 0:00:17
-------------------------- ------------- 66.3/99.8 MB 2.0 MB/s eta 0:00:17
-------------------------- ------------- 66.4/99.8 MB 2.0 MB/s eta 0:00:17
-------------------------- ------------- 66.5/99.8 MB 2.1 MB/s eta 0:00:17
-------------------------- ------------- 66.6/99.8 MB 2.1 MB/s eta 0:00:17
-------------------------- ------------- 66.7/99.8 MB 2.1 MB/s eta 0:00:16
-------------------------- ------------- 66.9/99.8 MB 2.1 MB/s eta 0:00:16
-------------------------- ------------- 66.9/99.8 MB 2.1 MB/s eta 0:00:16
-------------------------- ------------- 67.1/99.8 MB 2.1 MB/s eta 0:00:16
-------------------------- ------------- 67.2/99.8 MB 2.2 MB/s eta 0:00:16
-------------------------- ------------- 67.3/99.8 MB 2.2 MB/s eta 0:00:15
--------------------------- ------------ 67.4/99.8 MB 2.3 MB/s eta 0:00:15
--------------------------- ------------ 67.6/99.8 MB 2.3 MB/s eta 0:00:15
--------------------------- ------------ 67.7/99.8 MB 2.3 MB/s eta 0:00:15
--------------------------- ------------ 67.8/99.8 MB 2.3 MB/s eta 0:00:14
--------------------------- ------------ 68.0/99.8 MB 2.4 MB/s eta 0:00:14
--------------------------- ------------ 68.1/99.8 MB 2.4 MB/s eta 0:00:14
--------------------------- ------------ 68.2/99.8 MB 2.4 MB/s eta 0:00:14
--------------------------- ------------ 68.3/99.8 MB 2.5 MB/s eta 0:00:13
--------------------------- ------------ 68.5/99.8 MB 2.4 MB/s eta 0:00:13
--------------------------- ------------ 68.6/99.8 MB 2.5 MB/s eta 0:00:13
--------------------------- ------------ 68.7/99.8 MB 2.5 MB/s eta 0:00:13
--------------------------- ------------ 68.8/99.8 MB 2.5 MB/s eta 0:00:13
--------------------------- ------------ 69.0/99.8 MB 2.5 MB/s eta 0:00:13
--------------------------- ------------ 69.1/99.8 MB 2.5 MB/s eta 0:00:13
--------------------------- ------------ 69.2/99.8 MB 2.5 MB/s eta 0:00:13
--------------------------- ------------ 69.2/99.8 MB 2.5 MB/s eta 0:00:13
--------------------------- ------------ 69.2/99.8 MB 2.5 MB/s eta 0:00:13
--------------------------- ------------ 69.3/99.8 MB 2.4 MB/s eta 0:00:13
--------------------------- ------------ 69.3/99.8 MB 2.4 MB/s eta 0:00:13
--------------------------- ------------ 69.5/99.8 MB 2.4 MB/s eta 0:00:13
--------------------------- ------------ 69.6/99.8 MB 2.4 MB/s eta 0:00:13
--------------------------- ------------ 69.7/99.8 MB 2.4 MB/s eta 0:00:13
--------------------------- ------------ 69.7/99.8 MB 2.4 MB/s eta 0:00:13
--------------------------- ------------ 69.8/99.8 MB 2.4 MB/s eta 0:00:13
---------------------------- ----------- 69.9/99.8 MB 2.3 MB/s eta 0:00:13
---------------------------- ----------- 70.0/99.8 MB 2.3 MB/s eta 0:00:13
---------------------------- ----------- 70.1/99.8 MB 2.3 MB/s eta 0:00:13
---------------------------- ----------- 70.3/99.8 MB 2.4 MB/s eta 0:00:13
---------------------------- ----------- 70.4/99.8 MB 2.4 MB/s eta 0:00:13
---------------------------- ----------- 70.5/99.8 MB 2.4 MB/s eta 0:00:13
---------------------------- ----------- 70.7/99.8 MB 2.4 MB/s eta 0:00:13
---------------------------- ----------- 70.8/99.8 MB 2.4 MB/s eta 0:00:13
---------------------------- ----------- 70.9/99.8 MB 2.4 MB/s eta 0:00:13
---------------------------- ----------- 71.1/99.8 MB 2.4 MB/s eta 0:00:12
---------------------------- ----------- 71.2/99.8 MB 2.4 MB/s eta 0:00:12
---------------------------- ----------- 71.3/99.8 MB 2.4 MB/s eta 0:00:12
---------------------------- ----------- 71.4/99.8 MB 2.4 MB/s eta 0:00:12
---------------------------- ----------- 71.4/99.8 MB 2.4 MB/s eta 0:00:13
---------------------------- ----------- 71.5/99.8 MB 2.4 MB/s eta 0:00:12
---------------------------- ----------- 71.7/99.8 MB 2.4 MB/s eta 0:00:12
---------------------------- ----------- 71.7/99.8 MB 2.3 MB/s eta 0:00:12
---------------------------- ----------- 71.9/99.8 MB 2.3 MB/s eta 0:00:12
---------------------------- ----------- 71.9/99.8 MB 2.3 MB/s eta 0:00:12
---------------------------- ----------- 72.1/99.8 MB 2.3 MB/s eta 0:00:12
---------------------------- ----------- 72.1/99.8 MB 2.3 MB/s eta 0:00:12
---------------------------- ----------- 72.3/99.8 MB 2.3 MB/s eta 0:00:12
----------------------------- ---------- 72.4/99.8 MB 2.3 MB/s eta 0:00:12
----------------------------- ---------- 72.5/99.8 MB 2.3 MB/s eta 0:00:12
----------------------------- ---------- 72.6/99.8 MB 2.3 MB/s eta 0:00:12
----------------------------- ---------- 72.7/99.8 MB 2.3 MB/s eta 0:00:12
----------------------------- ---------- 72.8/99.8 MB 2.3 MB/s eta 0:00:12
----------------------------- ---------- 73.0/99.8 MB 2.3 MB/s eta 0:00:12
----------------------------- ---------- 73.1/99.8 MB 2.4 MB/s eta 0:00:12
----------------------------- ---------- 73.2/99.8 MB 2.4 MB/s eta 0:00:12
----------------------------- ---------- 73.3/99.8 MB 2.4 MB/s eta 0:00:12
----------------------------- ---------- 73.4/99.8 MB 2.4 MB/s eta 0:00:12
----------------------------- ---------- 73.5/99.8 MB 2.4 MB/s eta 0:00:11
----------------------------- ---------- 73.6/99.8 MB 2.4 MB/s eta 0:00:11
----------------------------- ---------- 73.7/99.8 MB 2.4 MB/s eta 0:00:11
----------------------------- ---------- 73.9/99.8 MB 2.4 MB/s eta 0:00:11
----------------------------- ---------- 74.0/99.8 MB 2.4 MB/s eta 0:00:11
----------------------------- ---------- 74.1/99.8 MB 2.4 MB/s eta 0:00:11
----------------------------- ---------- 74.2/99.8 MB 2.4 MB/s eta 0:00:11
----------------------------- ---------- 74.3/99.8 MB 2.4 MB/s eta 0:00:11
----------------------------- ---------- 74.4/99.8 MB 2.4 MB/s eta 0:00:11
----------------------------- ---------- 74.4/99.8 MB 2.4 MB/s eta 0:00:11
----------------------------- ---------- 74.6/99.8 MB 2.3 MB/s eta 0:00:11
----------------------------- ---------- 74.6/99.8 MB 2.3 MB/s eta 0:00:11
----------------------------- ---------- 74.7/99.8 MB 2.3 MB/s eta 0:00:11
----------------------------- ---------- 74.7/99.8 MB 2.3 MB/s eta 0:00:11
----------------------------- ---------- 74.8/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 74.9/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 74.9/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 75.0/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 75.1/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 75.2/99.8 MB 2.2 MB/s eta 0:00:11
------------------------------ --------- 75.3/99.8 MB 2.2 MB/s eta 0:00:11
------------------------------ --------- 75.3/99.8 MB 2.2 MB/s eta 0:00:11
------------------------------ --------- 75.4/99.8 MB 2.2 MB/s eta 0:00:11
------------------------------ --------- 75.6/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 75.7/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 75.9/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 76.0/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 76.1/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 76.3/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 76.4/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 76.5/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 76.7/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 76.8/99.8 MB 2.3 MB/s eta 0:00:11
------------------------------ --------- 76.9/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------ --------- 77.1/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------ --------- 77.2/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------ --------- 77.3/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 77.4/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 77.6/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 77.7/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 77.8/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 78.0/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 78.1/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 78.2/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 78.3/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 78.4/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 78.6/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 78.7/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 78.8/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 79.0/99.8 MB 2.3 MB/s eta 0:00:10
------------------------------- -------- 79.1/99.8 MB 2.3 MB/s eta 0:00:09
------------------------------- -------- 79.2/99.8 MB 2.3 MB/s eta 0:00:09
------------------------------- -------- 79.3/99.8 MB 2.3 MB/s eta 0:00:09
------------------------------- -------- 79.4/99.8 MB 2.3 MB/s eta 0:00:09
------------------------------- -------- 79.5/99.8 MB 2.4 MB/s eta 0:00:09
------------------------------- -------- 79.7/99.8 MB 2.4 MB/s eta 0:00:09
-------------------------------- ------- 79.8/99.8 MB 2.4 MB/s eta 0:00:09
-------------------------------- ------- 80.0/99.8 MB 2.4 MB/s eta 0:00:09
-------------------------------- ------- 80.1/99.8 MB 2.4 MB/s eta 0:00:09
-------------------------------- ------- 80.2/99.8 MB 2.4 MB/s eta 0:00:09
-------------------------------- ------- 80.4/99.8 MB 2.4 MB/s eta 0:00:08
-------------------------------- ------- 80.5/99.8 MB 2.4 MB/s eta 0:00:08
-------------------------------- ------- 80.6/99.8 MB 2.4 MB/s eta 0:00:08
-------------------------------- ------- 80.8/99.8 MB 2.4 MB/s eta 0:00:08
-------------------------------- ------- 80.9/99.8 MB 2.4 MB/s eta 0:00:08
-------------------------------- ------- 81.0/99.8 MB 2.4 MB/s eta 0:00:08
-------------------------------- ------- 81.2/99.8 MB 2.4 MB/s eta 0:00:08
-------------------------------- ------- 81.3/99.8 MB 2.4 MB/s eta 0:00:08
-------------------------------- ------- 81.4/99.8 MB 2.4 MB/s eta 0:00:08
-------------------------------- ------- 81.6/99.8 MB 2.4 MB/s eta 0:00:08
-------------------------------- ------- 81.7/99.8 MB 2.5 MB/s eta 0:00:08
-------------------------------- ------- 81.8/99.8 MB 2.5 MB/s eta 0:00:08
-------------------------------- ------- 81.9/99.8 MB 2.5 MB/s eta 0:00:08
-------------------------------- ------- 82.0/99.8 MB 2.5 MB/s eta 0:00:08
-------------------------------- ------- 82.1/99.8 MB 2.5 MB/s eta 0:00:08
-------------------------------- ------- 82.3/99.8 MB 2.5 MB/s eta 0:00:08
--------------------------------- ------ 82.4/99.8 MB 2.5 MB/s eta 0:00:08
--------------------------------- ------ 82.5/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 82.7/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 82.8/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 83.0/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 83.1/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 83.2/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 83.3/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 83.5/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 83.6/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 83.7/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 83.8/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 83.8/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 83.8/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 83.8/99.8 MB 2.4 MB/s eta 0:00:07
--------------------------------- ------ 83.9/99.8 MB 2.4 MB/s eta 0:00:07
--------------------------------- ------ 84.0/99.8 MB 2.5 MB/s eta 0:00:07
--------------------------------- ------ 84.1/99.8 MB 2.4 MB/s eta 0:00:07
--------------------------------- ------ 84.2/99.8 MB 2.4 MB/s eta 0:00:07
--------------------------------- ------ 84.3/99.8 MB 2.4 MB/s eta 0:00:07
--------------------------------- ------ 84.4/99.8 MB 2.4 MB/s eta 0:00:07
--------------------------------- ------ 84.5/99.8 MB 2.4 MB/s eta 0:00:07
--------------------------------- ------ 84.6/99.8 MB 2.4 MB/s eta 0:00:07
--------------------------------- ------ 84.7/99.8 MB 2.4 MB/s eta 0:00:07
--------------------------------- ------ 84.8/99.8 MB 2.4 MB/s eta 0:00:07
---------------------------------- ----- 84.9/99.8 MB 2.5 MB/s eta 0:00:07
---------------------------------- ----- 85.0/99.8 MB 2.5 MB/s eta 0:00:06
---------------------------------- ----- 85.1/99.8 MB 2.5 MB/s eta 0:00:06
---------------------------------- ----- 85.3/99.8 MB 2.5 MB/s eta 0:00:06
---------------------------------- ----- 85.4/99.8 MB 2.5 MB/s eta 0:00:06
---------------------------------- ----- 85.6/99.8 MB 2.6 MB/s eta 0:00:06
---------------------------------- ----- 85.6/99.8 MB 2.6 MB/s eta 0:00:06
---------------------------------- ----- 85.8/99.8 MB 2.6 MB/s eta 0:00:06
---------------------------------- ----- 85.9/99.8 MB 2.6 MB/s eta 0:00:06
---------------------------------- ----- 86.0/99.8 MB 2.6 MB/s eta 0:00:06
---------------------------------- ----- 86.1/99.8 MB 2.6 MB/s eta 0:00:06
---------------------------------- ----- 86.2/99.8 MB 2.6 MB/s eta 0:00:06
---------------------------------- ----- 86.4/99.8 MB 2.6 MB/s eta 0:00:06
---------------------------------- ----- 86.5/99.8 MB 2.6 MB/s eta 0:00:06
---------------------------------- ----- 86.6/99.8 MB 2.5 MB/s eta 0:00:06
---------------------------------- ----- 86.7/99.8 MB 2.5 MB/s eta 0:00:06
---------------------------------- ----- 86.8/99.8 MB 2.5 MB/s eta 0:00:06
---------------------------------- ----- 86.8/99.8 MB 2.5 MB/s eta 0:00:06
---------------------------------- ----- 87.0/99.8 MB 2.5 MB/s eta 0:00:06
---------------------------------- ----- 87.1/99.8 MB 2.5 MB/s eta 0:00:06
---------------------------------- ----- 87.2/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 87.3/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 87.4/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 87.5/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 87.7/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 87.8/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 88.0/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 88.1/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 88.3/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 88.4/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 88.5/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 88.6/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 88.8/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 88.9/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 89.0/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 89.1/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 89.2/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 89.2/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 89.3/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 89.4/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 89.6/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 89.6/99.8 MB 2.5 MB/s eta 0:00:05
----------------------------------- ---- 89.7/99.8 MB 2.5 MB/s eta 0:00:05
------------------------------------ --- 89.8/99.8 MB 2.5 MB/s eta 0:00:05
------------------------------------ --- 90.0/99.8 MB 2.5 MB/s eta 0:00:04
------------------------------------ --- 90.1/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 90.2/99.8 MB 2.5 MB/s eta 0:00:04
------------------------------------ --- 90.3/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 90.4/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 90.5/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 90.6/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 90.7/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 90.8/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 90.9/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 91.0/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 91.2/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 91.3/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 91.4/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 91.5/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 91.6/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 91.7/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 91.8/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 91.9/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 92.0/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------ --- 92.1/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------- -- 92.3/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------- -- 92.4/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------- -- 92.5/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------- -- 92.6/99.8 MB 2.4 MB/s eta 0:00:04
------------------------------------- -- 92.8/99.8 MB 2.4 MB/s eta 0:00:03
------------------------------------- -- 92.9/99.8 MB 2.4 MB/s eta 0:00:03
------------------------------------- -- 93.0/99.8 MB 2.3 MB/s eta 0:00:03
------------------------------------- -- 93.1/99.8 MB 2.3 MB/s eta 0:00:03
------------------------------------- -- 93.3/99.8 MB 2.3 MB/s eta 0:00:03
------------------------------------- -- 93.4/99.8 MB 2.3 MB/s eta 0:00:03
------------------------------------- -- 93.5/99.8 MB 2.4 MB/s eta 0:00:03
------------------------------------- -- 93.7/99.8 MB 2.4 MB/s eta 0:00:03
------------------------------------- -- 93.8/99.8 MB 2.4 MB/s eta 0:00:03
------------------------------------- -- 93.9/99.8 MB 2.4 MB/s eta 0:00:03
------------------------------------- -- 94.0/99.8 MB 2.3 MB/s eta 0:00:03
------------------------------------- -- 94.1/99.8 MB 2.4 MB/s eta 0:00:03
------------------------------------- -- 94.2/99.8 MB 2.4 MB/s eta 0:00:03
------------------------------------- -- 94.4/99.8 MB 2.4 MB/s eta 0:00:03
------------------------------------- -- 94.5/99.8 MB 2.4 MB/s eta 0:00:03
------------------------------------- -- 94.6/99.8 MB 2.5 MB/s eta 0:00:03
------------------------------------- -- 94.7/99.8 MB 2.5 MB/s eta 0:00:03
-------------------------------------- - 94.9/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 95.0/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 95.2/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 95.3/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 95.4/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 95.5/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 95.7/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 95.8/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 95.9/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 96.0/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 96.1/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 96.3/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 96.4/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 96.5/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 96.5/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 96.6/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 96.7/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 96.8/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 96.9/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 97.0/99.8 MB 2.5 MB/s eta 0:00:02
-------------------------------------- - 97.2/99.8 MB 2.5 MB/s eta 0:00:02
--------------------------------------- 97.3/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 97.5/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 97.5/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 97.5/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 97.6/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 97.7/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 97.8/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 97.9/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 98.0/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 98.2/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 98.3/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 98.4/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 98.5/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 98.7/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 98.8/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 98.9/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 99.1/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 99.2/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 99.3/99.8 MB 2.4 MB/s eta 0:00:01
--------------------------------------- 99.5/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 99.6/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 99.7/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 99.7/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 99.7/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 99.7/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 99.7/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 99.7/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 99.7/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 99.7/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 99.7/99.8 MB 2.5 MB/s eta 0:00:01
--------------------------------------- 99.7/99.8 MB 2.5 MB/s eta 0:00:01
---------------------------------------- 99.8/99.8 MB 2.2 MB/s eta 0:00:00
Installing collected packages: xgboost
Successfully installed xgboost-2.0.2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy import stats
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier, plot_importance
from sklearn.neural_network import MLPClassifier
import time
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix,
classification_report,
ConfusionMatrixDisplay,
precision_recall_curve,
average_precision_score,
roc_curve,
roc_auc_score
)
from sklearn.calibration import CalibrationDisplay
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
df_train = pd.read_csv("data/df_train.csv")
df_test = pd.read_csv("data/df_test.csv")
print(df_train.shape)
df_test.shape
(55749, 32)
(16000, 32)
df_train.columns
Index(['flow_duration', 'Header_Length', 'Protocol Type', 'Duration', 'Rate',
'fin_flag_number', 'syn_flag_number', 'rst_flag_number',
'psh_flag_number', 'ack_flag_number', 'ack_count', 'syn_count',
'fin_count', 'urg_count', 'rst_count', 'HTTP', 'HTTPS', 'SSH', 'TCP',
'UDP', 'ICMP', 'Tot sum', 'Min', 'Max', 'AVG', 'Std', 'Tot size', 'IAT',
'Covariance', 'Variance', 'label', 'category_label'],
dtype='object')
X_train = df_train.drop(columns=["label", "category_label"])
X_test = df_test.drop(columns=["label", "category_label"])
y_train = df_train["category_label"]
y_test = df_test["category_label"]
Now we are scaling values in the train and the test set (everything in this case, because there are only numerical features) to have better performances and predictions.
standard_scaler = StandardScaler().fit(X_train)
X_train = standard_scaler.transform(X_train)
X_test = standard_scaler.transform(X_test)
Let's save the standars scaler also for the future
dump(standard_scaler, "scalers/standard_scaler.bin", compress=True)
['scalers/standard_scaler.bin']
X_train
array([[-0.36801019, -0.35346166, 0.46054157, ..., -0.01018213,
-0.35051081, -1.27017796],
[-0.36888444, -0.38354661, -0.52506693, ..., -0.01058209,
-0.35051081, -1.27017796],
[-0.36888444, -0.38354661, -0.52506693, ..., -0.00643102,
-0.35051081, -1.27017796],
...,
[ 0.36874052, -0.37683062, -0.33690531, ..., 1.26147349,
-0.34579192, 0.99974844],
[-0.11088783, -0.32960377, -0.52506693, ..., 1.2614745 ,
1.57512056, 0.99974844],
[ 0.73144974, -0.33881354, -0.22938438, ..., -1.27568763,
-0.34866664, 0.7727558 ]])
X_test
array([[-0.3594684 , 3.34878602, -0.52506693, ..., 1.26151154,
3.37066307, 0.99974844],
[-0.33873717, -0.38126271, 0.46054157, ..., -1.27568763,
-0.3451429 , 0.7727558 ],
[-0.36784072, -0.38318251, -0.52506693, ..., -1.27568763,
-0.35007724, 0.7727558 ],
...,
[ 0.53284477, -0.34475238, -0.47130647, ..., 1.26163369,
-0.34559026, 0.99974844],
[ 0.09184591, -0.32815799, 0.26341987, ..., -1.27568763,
-0.34286041, 0.7727558 ],
[-0.36888444, -0.38354661, -0.52506693, ..., -0.00644665,
-0.35051081, -1.27017796]])
label2id = {'BenignTraffic': 0, 'Mirai': 1, 'DDoS': 2, 'Spoofing': 3, 'Web': 4, 'Recon': 5, 'BruteForce': 6, 'DoS': 7}
id2label = {0: 'BenignTraffic', 1: 'Mirai', 2: 'DDoS', 3: 'Spoofing', 4: 'Web', 5: 'Recon', 6: 'BruteForce', 7: 'DoS'}
y_train = [label2id[label] for label in y_train]
y_test = [label2id[label] for label in y_test]
df_test_final = pd.read_csv("data/final_test_data.csv")
df_test_final = df_test_final[df_train.columns]
X_test_final = df_test_final.drop(columns=["label", "category_label"])
y_test_final = df_test_final["category_label"]
X_test_final = standard_scaler.transform(X_test_final)
y_test_final = [label2id[label] for label in y_test_final]
X_test_final
array([[-0.36888444, -0.38354661, -0.52506693, ..., -0.00637693,
-0.35051081, -1.27017796],
[-0.36858791, -0.35326693, 0.46054157, ..., -0.01005686,
-0.35051081, -1.27017796],
[-0.3686571 , -0.36995928, 0.46054157, ..., -0.01019597,
-0.35049261, -1.15668164],
...,
[ 0.46692148, -0.34883984, -0.13082353, ..., 1.26109575,
-0.04195424, 0.99974844],
[-0.36870301, -0.3834317 , -0.417546 , ..., 1.26147301,
-0.34580619, 0.99974844],
[-0.10955779, -0.31405063, -0.48922662, ..., 1.26147385,
-0.3486514 , 0.99974844]])
df_train.columns
Index(['flow_duration', 'Header_Length', 'Protocol Type', 'Duration', 'Rate',
'fin_flag_number', 'syn_flag_number', 'rst_flag_number',
'psh_flag_number', 'ack_flag_number', 'ack_count', 'syn_count',
'fin_count', 'urg_count', 'rst_count', 'HTTP', 'HTTPS', 'SSH', 'TCP',
'UDP', 'ICMP', 'Tot sum', 'Min', 'Max', 'AVG', 'Std', 'Tot size', 'IAT',
'Covariance', 'Variance', 'label', 'category_label'],
dtype='object')
df_test_final.category_label.value_counts().plot(kind='bar')
<Axes: >
def plot_metric(eval_dict, model_name, param_name, param_values, log_scale=False):
fig, ax = plt.subplots(3, 2, figsize=(12, 8), sharex=True)
metrics = {
'Accuracy': 'list_accuracy',
'Precision': 'list_precision',
'Recall': 'list_recall',
'Specificity': 'list_specificity',
'F1': 'list_f1',
'Training time': 'list_training_time'
}
colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:olive', 'tab:cyan', 'tab:red']
for i, (metric_name, metric_key) in enumerate(metrics.items()):
x_value = list(range(0, len(eval_dict[metric_key])))
ax[i // 2, i % 2].plot(x_value, eval_dict[metric_key], c=colors[i])
if log_scale:
ax[i // 2, i % 2].set_xscale('log')
# Add xticks at the position of the maximum value
max_idx = np.argmax(eval_dict[metric_key])
max_value = eval_dict[metric_key][max_idx]
ax[i // 2, i % 2].axvline(x=x_value[max_idx], color='lightgray', linestyle='--')
ax[i // 2, i % 2].axhline(y=max_value, color='lightgray', linestyle='--')
ax[i // 2, i % 2].scatter(x_value[max_idx], max_value, color='red', marker='o')
ax[i // 2, i % 2].set_xticks(ticks=x_value, labels=param_values, rotation=90)
ax[i // 2, i % 2].set_title(f'{metric_name} ({model_name})')
ax[i // 2, i % 2].set_xlabel(param_name)
# Add y-axis label (tick) for the maximum value
ax[i // 2, i % 2].set_ylabel(metric_name)
yticks = list(ax[i // 2, i % 2].get_yticks())
del yticks[-2:]
yticks.append(max_value)
ax[i // 2, i % 2].set_yticks(yticks)
if metric_name != 'Training time':
ax[i // 2, i % 2].text(max_idx, eval_dict[metric_key][np.argmin(eval_dict[metric_key])], f'{param_values[max_idx]}', c='red')
plt.tight_layout()
plt.show()
For XGBoost, the parameter we are considering is the number of estimators- which are the number of trees.
list_estimators = range(1, 200, 5)
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
best_accuracy = 0.0
for val in list_estimators:
# Create classifier instance
clf = XGBClassifier(n_estimators=val, n_jobs=-1, random_state=777)
# Register start time
t_start = time.time()
# Fit the classifier on the training data
clf.fit(X_train, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = clf.predict(X_test)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(clf, 'models/ensemble/xgb_best.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
plot_metric(eval_dict, 'XGBoost', 'n_estimators', list_estimators)
Overall, the model performance tends to reach an optimal performance with 111 estimators (except for precision, which has 101 estimators)
clf = load('models/ensemble/xgb_best.joblib')
y_pred = clf.predict(X_test_final)
print(classification_report(y_test_final, y_pred, target_names=label2id.keys()))
precision recall f1-score support
BenignTraffic 0.87 0.88 0.88 10000
Mirai 1.00 0.98 0.99 10000
DDoS 1.00 0.96 0.98 10000
Spoofing 0.87 0.83 0.85 10000
Web 0.78 0.87 0.82 10000
Recon 0.86 0.78 0.82 10000
BruteForce 0.60 0.81 0.69 3064
DoS 1.00 0.97 0.98 10000
accuracy 0.89 73064
macro avg 0.87 0.88 0.88 73064
weighted avg 0.90 0.89 0.89 73064
ConfusionMatrixDisplay.from_predictions(y_test_final, y_pred, display_labels=label2id, cmap='Blues', normalize='true', xticks_rotation=45)
plt.show()
Let's try some features selection
feature_names = df_train.drop(columns=["label", "category_label"]).columns
feature_importance = clf.feature_importances_
feature_importance_dict = dict(zip(feature_names, feature_importance))
sorted_features = sorted(feature_importance_dict.items(), key=lambda x: x[1])
feature_names, importance_values = zip(*sorted_features)
plt.figure(figsize=(10, 8))
plt.barh(range(len(feature_names)), importance_values, align='center')
plt.yticks(range(len(feature_names)), feature_names)
plt.xlabel('Feature Importance')
plt.title('Feature Importance Plot')
plt.show()
XGBoost gives importance to certain features by selecting the most relevant ones. What is important to notice is the threshold value can be chosen arbitrarily, it just is a matter of how complex you want the complexity of analysis to be. We decided to set the threshold as feature importance= 0.05.
important_features = [feature for feature, importance in sorted_features if importance >= 0.05]
important_features
['Min', 'fin_count', 'fin_flag_number', 'AVG', 'IAT', 'UDP', 'SSH']
list_estimators = range(1, 200, 10)
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
X_train_xgb_important = df_train[important_features]
X_test_xgb_important = df_test[important_features]
X_train_xgb_important = standard_scaler.fit_transform(X_train_xgb_important)
X_test_xgb_important = standard_scaler.transform(X_test_xgb_important)
best_accuracy = 0.0
for val in list_estimators:
# Create classifier instance
clf = XGBClassifier(n_estimators=val, n_jobs=-1, random_state=777)
# Register start time
t_start = time.time()
# Fit the classifier on the training data
clf.fit(X_train_xgb_important, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = clf.predict(X_test_xgb_important)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(clf, 'models/ensemble/xgb_best_feature_importance_05.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
plot_metric(eval_dict, 'XGBoost Selected', 'n_estimators', list_estimators)
Compared to before, the optimal number of estimators selected are 51 (as opposed to 111). We can see that the performance is much worse with this threshold, let's try with a smaller one.
important_features = [feature for feature, importance in sorted_features if importance >= 0.02]
important_features
['syn_flag_number', 'Max', 'rst_count', 'Variance', 'Min', 'fin_count', 'fin_flag_number', 'AVG', 'IAT', 'UDP', 'SSH']
list_estimators = range(1, 200, 10)
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
X_train_xgb_important = df_train[important_features]
X_test_xgb_important = df_test[important_features]
X_train_xgb_important = standard_scaler.fit_transform(X_train_xgb_important)
X_test_xgb_important = standard_scaler.transform(X_test_xgb_important)
best_accuracy = 0.0
for val in list_estimators:
# Create classifier instance
clf = XGBClassifier(n_estimators=val, n_jobs=-1, random_state=777)
# Register start time
t_start = time.time()
# Fit the classifier on the training data
clf.fit(X_train_xgb_important, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = clf.predict(X_test_xgb_important)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(clf, 'models/ensemble/xgb_best_feature_importance_02.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
plot_metric(eval_dict, 'XGBoost Selected', 'n_estimators', list_estimators)
A little better, but still not nearly as good as the full one, because the full one doesn't take too long, we will consider that as better.
For Random Forests, the parameter under consideration is the number of trees.
list_estimators = range(1,200,10)
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
best_accuracy = 0.0
for val in list_estimators:
# Create classifier instance
clf = RandomForestClassifier(n_estimators=val, n_jobs=-1, random_state=777)
# Register start time
t_start = time.time()
# Fit the classifier on the training data
clf.fit(X_train, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = clf.predict(X_test)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(clf, 'models/ensemble/RF_best.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn(
plot_metric(eval_dict, 'Random Forest', 'n_estimators', list_estimators)
With random forests, we found that the optimal parameter is 166 trees, higher than XGBoost.
clf = load('models/ensemble/RF_best.joblib')
y_pred = clf.predict(X_test_final)
print(classification_report(y_test_final, y_pred, target_names=label2id))
precision recall f1-score support
BenignTraffic 0.86 0.86 0.86 10000
Mirai 0.99 0.99 0.99 10000
DDoS 1.00 0.96 0.98 10000
Spoofing 0.83 0.81 0.82 10000
Web 0.78 0.87 0.82 10000
Recon 0.88 0.75 0.81 10000
BruteForce 0.59 0.82 0.68 3064
DoS 1.00 0.97 0.98 10000
accuracy 0.88 73064
macro avg 0.87 0.88 0.87 73064
weighted avg 0.89 0.88 0.89 73064
ConfusionMatrixDisplay.from_predictions(y_test_final, y_pred, display_labels=label2id, cmap='Blues', normalize='true', xticks_rotation=45)
plt.show()
The performances overall are the same, but we believe that XGBoost in our case is a slightly better choice, considering that it is more relevant for our analysis (its performance is .88 with respect to .86)
feature_names = df_train.drop(columns=["label", "category_label"]).columns
feature_importance = clf.feature_importances_
feature_importance_dict = dict(zip(feature_names, feature_importance))
sorted_features = sorted(feature_importance_dict.items(), key=lambda x: x[1])
feature_names, importance_values = zip(*sorted_features)
plt.figure(figsize=(10, 8))
plt.barh(range(len(feature_names)), importance_values, align='center')
plt.yticks(range(len(feature_names)), feature_names)
plt.xlabel('Feature Importance')
plt.title('Feature Importance Plot')
plt.show()
important_features = [feature for feature, importance in sorted_features if importance >= 0.05]
important_features
['flow_duration', 'rst_count', 'Tot size', 'AVG', 'Header_Length', 'IAT']
list_estimators = range(1, 200, 10)
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
X_train_rf_important = df_train[important_features]
X_test_rf_important = df_test[important_features]
X_train_rf_important = standard_scaler.fit_transform(X_train_rf_important)
X_test_rf_important = standard_scaler.transform(X_test_rf_important)
best_accuracy = 0.0
for val in list_estimators:
# Create classifier instance
clf = RandomForestClassifier(n_estimators=val, n_jobs=-1, random_state=777)
# Register start time
t_start = time.time()
# Fit the classifier on the training data
clf.fit(X_train_rf_important, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = clf.predict(X_test_rf_important)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(clf, 'models/ensemble/RF_best_feature_importance_05.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn(
plot_metric(eval_dict, 'Random Forests Selected', 'n_estimators', list_estimators)
This one goes much better than with xgboost, and maybe with more DT it would get even better
important_features = [feature for feature, importance in sorted_features if importance >= 0.02]
important_features
['Duration', 'Std', 'syn_count', 'Covariance', 'urg_count', 'Tot sum', 'Protocol Type', 'Rate', 'Variance', 'Min', 'Max', 'flow_duration', 'rst_count', 'Tot size', 'AVG', 'Header_Length', 'IAT']
list_estimators = range(1, 200, 10)
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
X_train_rf_important = df_train[important_features]
X_test_rf_important = df_test[important_features]
X_train_rf_important = standard_scaler.fit_transform(X_train_rf_important)
X_test_rf_important = standard_scaler.transform(X_test_rf_important)
best_accuracy = 0.0
for val in list_estimators:
# Create classifier instance
clf = RandomForestClassifier(n_estimators=val, n_jobs=-1, random_state=777)
# Register start time
t_start = time.time()
# Fit the classifier on the training data
clf.fit(X_train_rf_important, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = clf.predict(X_test_rf_important)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(clf, 'models/ensemble/RF_best_feature_importance_02.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn(
plot_metric(eval_dict, 'Random Forests Selected', 'n_estimators', list_estimators)
We can see that is actually works pretty similar to the one with all the features because some features are not really needed.
Let's try to see if the predictions that are created by different classifiers are different, so that we can think about a voting classifier
We will consider the following, all with the best paramethers: Decision Trees, Random Forests, XGBoost, since they were the ones that performed best
y_dt = load('models/nonLinear/DT_best.joblib').predict(X_test_final)
y_rf = load('models/ensemble/RF_best.joblib').predict(X_test_final)
y_xgb = load('models/ensemble/xgb_best.joblib').predict(X_test_final)
sum([1 for i, j in zip(y_dt, y_rf) if i != j])
6499
sum([1 for i, j in zip(y_dt, y_xgb) if i != j])
6419
sum([1 for i, j in zip(y_rf, y_xgb) if i != j])
4360
sum([1 for i, j, k in zip(y_rf, y_xgb, y_dt) if i != j != k])
2598
We can see that some of them are definitely different, let's try to create a couple of voting classifiers to check for improvement
We had the best DT with 23 as max_depth\ We had the best RF with 166 estimators\ We had the best XGBoost with 111 estimators
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
clf = VotingClassifier(estimators=[
('DT', DecisionTreeClassifier(max_depth=23, random_state=777)),
('RF', RandomForestClassifier(n_estimators=166, random_state=777)),
('xgb', XGBClassifier(n_estimators=111, random_state=777))],
voting='hard', n_jobs=-1, verbose=1)
clf.fit(X_train, y_train)
[Voting] ....................... (1 of 3) Processing DT, total= 1.0s [Voting] ...................... (3 of 3) Processing xgb, total= 7.0s [Voting] ....................... (2 of 3) Processing RF, total= 23.9s
VotingClassifier(estimators=[('DT',
DecisionTreeClassifier(max_depth=23,
random_state=777)),
('RF',
RandomForestClassifier(n_estimators=166,
random_state=777)),
('xgb',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False...
importance_type=None,
interaction_constraints=None,
learning_rate=None, max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None, max_depth=None,
max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=111, n_jobs=None,
num_parallel_tree=None,
random_state=777, ...))],
n_jobs=-1, verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. VotingClassifier(estimators=[('DT',
DecisionTreeClassifier(max_depth=23,
random_state=777)),
('RF',
RandomForestClassifier(n_estimators=166,
random_state=777)),
('xgb',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False...
importance_type=None,
interaction_constraints=None,
learning_rate=None, max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None, max_depth=None,
max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=111, n_jobs=None,
num_parallel_tree=None,
random_state=777, ...))],
n_jobs=-1, verbose=1)DecisionTreeClassifier(max_depth=23, random_state=777)
RandomForestClassifier(n_estimators=166, random_state=777)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=111, n_jobs=None,
num_parallel_tree=None, random_state=777, ...)dump(clf, 'models/ensemble/voting_hard_best.joblib')
['models/ensemble/voting_hard_best.joblib']
print(classification_report(y_test, clf.predict(X_test), target_names=label2id))
precision recall f1-score support
BenignTraffic 0.85 0.90 0.87 2036
Mirai 0.99 0.99 0.99 2035
DDoS 1.00 0.97 0.98 1973
Spoofing 0.85 0.82 0.84 1960
Web 0.73 0.88 0.80 1974
Recon 0.88 0.77 0.82 2038
BruteForce 0.84 0.81 0.82 2051
DoS 1.00 0.98 0.99 1933
accuracy 0.89 16000
macro avg 0.89 0.89 0.89 16000
weighted avg 0.89 0.89 0.89 16000
ConfusionMatrixDisplay.from_predictions(y_test, clf.predict(X_test), display_labels=label2id, cmap='Blues', normalize='true', xticks_rotation=45)
plt.show()
clf = VotingClassifier(estimators=[
('DT', DecisionTreeClassifier(max_depth=23, random_state=777)),
('RF', RandomForestClassifier(n_estimators=166, random_state=777)),
('xgb', XGBClassifier(n_estimators=111, random_state=777))],
voting='soft', n_jobs=-1, verbose=1)
clf.fit(X_train, y_train)
[Voting] ....................... (1 of 3) Processing DT, total= 1.0s [Voting] ...................... (3 of 3) Processing xgb, total= 8.2s [Voting] ....................... (2 of 3) Processing RF, total= 24.1s
VotingClassifier(estimators=[('DT',
DecisionTreeClassifier(max_depth=23,
random_state=777)),
('RF',
RandomForestClassifier(n_estimators=166,
random_state=777)),
('xgb',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False...
importance_type=None,
interaction_constraints=None,
learning_rate=None, max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None, max_depth=None,
max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=111, n_jobs=None,
num_parallel_tree=None,
random_state=777, ...))],
n_jobs=-1, verbose=1, voting='soft')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. VotingClassifier(estimators=[('DT',
DecisionTreeClassifier(max_depth=23,
random_state=777)),
('RF',
RandomForestClassifier(n_estimators=166,
random_state=777)),
('xgb',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False...
importance_type=None,
interaction_constraints=None,
learning_rate=None, max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None, max_depth=None,
max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=111, n_jobs=None,
num_parallel_tree=None,
random_state=777, ...))],
n_jobs=-1, verbose=1, voting='soft')DecisionTreeClassifier(max_depth=23, random_state=777)
RandomForestClassifier(n_estimators=166, random_state=777)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=111, n_jobs=None,
num_parallel_tree=None, random_state=777, ...)dump(clf, 'models/ensemble/voting_soft_best.joblib')
['models/ensemble/voting_soft_best.joblib']
print(classification_report(y_test, clf.predict(X_test), target_names=label2id))
precision recall f1-score support
BenignTraffic 0.86 0.89 0.87 2036
Mirai 1.00 0.98 0.99 2035
DDoS 1.00 0.97 0.98 1973
Spoofing 0.85 0.82 0.83 1960
Web 0.73 0.87 0.80 1974
Recon 0.85 0.78 0.81 2038
BruteForce 0.83 0.81 0.82 2051
DoS 1.00 0.98 0.99 1933
accuracy 0.89 16000
macro avg 0.89 0.89 0.89 16000
weighted avg 0.89 0.89 0.89 16000
ConfusionMatrixDisplay.from_predictions(y_test, clf.predict(X_test), display_labels=label2id, cmap='Blues', normalize='true', xticks_rotation=45)
plt.show()
the hard one seems a little better at understanding the Benign (on the small test), so we'll use that one
print(classification_report(y_test_final, clf.predict(X_test_final), target_names=label2id))
precision recall f1-score support
BenignTraffic 0.87 0.88 0.87 10000
Mirai 1.00 0.99 0.99 10000
DDoS 1.00 0.96 0.98 10000
Spoofing 0.87 0.82 0.85 10000
Web 0.79 0.88 0.83 10000
Recon 0.87 0.78 0.82 10000
BruteForce 0.60 0.81 0.69 3064
DoS 1.00 0.98 0.99 10000
accuracy 0.89 73064
macro avg 0.87 0.89 0.88 73064
weighted avg 0.90 0.89 0.90 73064
ConfusionMatrixDisplay.from_predictions(y_test_final, clf.predict(X_test_final), display_labels=label2id, cmap='Blues', normalize='true', xticks_rotation=45)
plt.show()
We get a slight increare in performace, without requiring too much time. We are happy that the predictions stay consistent in a larger test dataset.
From the sklearn documentation: An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
We will try this classifier after looking at this paper: https://www.mdpi.com/1424-8220/23/13/5941#B99-sensors-23-05941
The results in the paper for the AdaBoost are not great, but we will try just some to check how it works.
from sklearn.ensemble import AdaBoostClassifier
list_estimators = range(1,200,20)
eval_dict = {'list_accuracy': [],
'list_precision': [],
'list_recall': [],
'list_specificity': [],
'list_f1': [],
'list_training_time': []}
best_accuracy = 0.0
for val in list_estimators:
# Create classifier instance
clf = AdaBoostClassifier(n_estimators=val, random_state=777)
# Register start time
t_start = time.time()
# Fit the classifier on the training data
clf.fit(X_train, y_train)
# Register end time
t_stop = time.time()
# Predict probabilities on the test data
y_pred = clf.predict(X_test)
# Calculate accuracy
current_accuracy = accuracy_score(y_test, y_pred)
# Save the model if it has the best accuracy so far
if current_accuracy > best_accuracy:
best_accuracy = current_accuracy
dump(clf, 'models/ensemble/RF_best.joblib')
eval_dict['list_accuracy'].append(current_accuracy)
eval_dict['list_precision'].append(precision_score(y_test, y_pred, average='weighted'))
eval_dict['list_recall'].append(recall_score(y_test, y_pred, average='weighted'))
eval_dict['list_specificity'].append(recall_score(y_test, y_pred, pos_label=0, average='weighted'))
eval_dict['list_f1'].append(f1_score(y_test, y_pred, average='weighted'))
eval_dict['list_training_time'].append(t_stop - t_start)
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1469: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn( /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1521: UserWarning: Note that pos_label (set to 0) is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class. warnings.warn(
plot_metric(eval_dict, 'AdaBoost', 'n_estimators', list_estimators)
We can see in fact, that it performs quite bad. It was fun to try though.
In this section of the project, we will train unsupervised classifiers.
We will use the following metrics to evaluate them:
Homogeneity score: A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.
Completeness score: A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.
V-measure score: the harmonic mean between homogeneity and completeness.
Distortion score: the sum of square distances from each point to its assigned center.
calinski_harabasz_score: The score is defined as ratio of the sum of between-cluster dispersion and of within-cluster dispersion.
Some of the following code is a repetition from before, this is because originally we were using multiple notebooks and preferred to keep the structure to run single sections if needed.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler
from scipy import stats
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import (
homogeneity_score,
completeness_score,
v_measure_score,
silhouette_score,
calinski_harabasz_score,
)
import time
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix,
classification_report,
ConfusionMatrixDisplay,
precision_recall_curve,
average_precision_score,
roc_curve,
roc_auc_score
)
from sklearn.calibration import CalibrationDisplay
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
Because we are working with unlabeled data, it does't make sense to use the modified data we processed. We will use the full dataset we were able to obtain from the Kaggle Repository. This is because when we were processing the dataset, we looked at the distributions by label.
df = pd.read_csv("data/full_balanced_data.csv")
df.shape
(153064, 48)
df.columns
Index(['flow_duration', 'Header_Length', 'Protocol Type', 'Duration', 'Rate',
'Srate', 'Drate', 'fin_flag_number', 'syn_flag_number',
'rst_flag_number', 'psh_flag_number', 'ack_flag_number',
'ece_flag_number', 'cwr_flag_number', 'ack_count', 'syn_count',
'fin_count', 'urg_count', 'rst_count', 'HTTP', 'HTTPS', 'DNS', 'Telnet',
'SMTP', 'SSH', 'IRC', 'TCP', 'UDP', 'DHCP', 'ARP', 'ICMP', 'IPv', 'LLC',
'Tot sum', 'Min', 'Max', 'AVG', 'Std', 'Tot size', 'IAT', 'Number',
'Magnitue', 'Radius', 'Covariance', 'Variance', 'Weight', 'label',
'category_label'],
dtype='object')
We will just remove columns that have all same value
def is_unique(s):
a = s.to_numpy()
return (a[0] == a).all()
col = []
for i in df.columns:
out = is_unique(df[i])
if out == False:
col.append(i)
df = df[col]
df.shape
(153064, 44)
X = df.drop(columns=["label", "category_label"])
y = df["category_label"]
Now we are scaling values in the train and the test set (everything in this case, because there are only numerical features) to have better performances and predictions.
standard_scaler = StandardScaler().fit(X)
X = standard_scaler.transform(X)
Let's save the standars scaler also for the future
#dump(standard_scaler, "scalers/standard_scaler.bin", compress=True)
X
array([[-0.09225225, -0.38725648, -0.52567095, ..., -0.22467681,
-1.2470541 , 0.00223954],
[-0.09433133, -0.38733216, -0.96823971, ..., -0.22467681,
-1.2470541 , 0.00223954],
[-0.09433133, -0.38728425, -0.52567095, ..., -0.22466911,
-1.17866775, 0.00223954],
...,
[-0.09406873, 0.04294229, -0.52567095, ..., 0.4295219 ,
0.8045363 , -1.2954638 ],
[-0.09426278, -0.38685539, -0.52567095, ..., -0.22456615,
-0.10728166, -1.2954638 ],
[-0.01101208, -0.37959922, 0.15588494, ..., -0.22321886,
1.03249079, 1.29994288]])
Here we assigned numerical values to each label
label2id = {'BenignTraffic': 0, 'Mirai': 1, 'DDoS': 2, 'Spoofing': 3, 'Web': 4, 'Recon': 5, 'BruteForce': 6, 'DoS': 7}
id2label = {0: 'BenignTraffic', 1: 'Mirai', 2: 'DDoS', 3: 'Spoofing', 4: 'Web', 5: 'Recon', 6: 'BruteForce', 7: 'DoS'}
y = [label2id[label] for label in y]
This helper function can help us plot KMeans when reduced to 2 dimensions with PCA
def plot_data_C(ax, X, col):
ax.scatter(X[:, 0], X[:, 1], c=col, marker='.', s=2)
def plot_centroids_C(ax, centroids, weights=None, circle_color='w', cross_color='k'):
if weights is not None:
centroids = centroids[weights > weights.max() / 10]
ax.scatter(centroids[:, 0], centroids[:, 1],
marker='o', s=35, linewidths=8,
color=circle_color, zorder=10, alpha=0.9)
ax.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=2, linewidths=12,
color=cross_color, zorder=11, alpha=1)
def plot_decision_boundaries(clusterer, X, col, resolution=1000, show_centroids=True,
show_xlabels=True, show_ylabels=True, ax=None):
mins = X.min(axis=0) - 0.1
maxs = X.max(axis=0) + 0.1
xx, yy = np.meshgrid(np.linspace(mins[0], maxs[0], resolution),
np.linspace(mins[1], maxs[1], resolution))
Z = clusterer.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax.contourf(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]),
cmap="Pastel2")
ax.contour(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]),
linewidths=1, colors='k')
plot_data_C(ax, X, col)
if show_centroids:
plot_centroids_C(ax, clusterer.cluster_centers_)
if show_xlabels:
ax.set_xlabel("$x_1$", fontsize=14)
else:
ax.tick_params(labelbottom=False)
if show_ylabels:
ax.set_ylabel("$x_2$", fontsize=14, rotation=0)
else:
ax.tick_params(labelleft=False)
return ax
pca = PCA(n_components = 2)
X_PCA_2 = pca.fit_transform(X)
print(X_PCA_2.shape)
(153064, 2)
col = []
for c in y:
if c == 0:
col.append("b")
elif c == 1:
col.append("g")
elif c == 2:
col.append("r")
elif col == 3:
col.append("c")
elif col == 4:
col.append("m")
elif col == 5:
col.append("y")
elif col == 6:
col.append("#000000")
else:
col.append("#FFC0CB")
print(len(col))
153064
from sklearn import metrics
from scipy import stats
k = 8
for i in [42, 777, 1234]:
kmeans1 = KMeans(n_clusters=k, random_state=i, n_init=10, max_iter=1)
kmeans10 = KMeans(n_clusters=k, random_state=i, n_init=10, max_iter=10)
kmeans300 = KMeans(n_clusters=k, random_state=i, n_init=10, max_iter=300)
kmeans1.fit(X_PCA_2)
kmeans10.fit(X_PCA_2)
kmeans300.fit(X_PCA_2)
y_pred_1 = kmeans1.predict(X_PCA_2)
y_pred_10 = kmeans10.predict(X_PCA_2)
y_pred_300 = kmeans300.predict(X_PCA_2)
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(20, 5))
axes[0].clear()
plot_decision_boundaries(kmeans1, X_PCA_2, col, ax=axes[0])
axes[0].set_title(f"kmeans for random state {i}, max_iter=1")
axes[0].annotate(f"""
homogeneity_score: {homogeneity_score(y, y_pred_1)}
completeness_score: {completeness_score(y, y_pred_1)}
v_measure_score: {v_measure_score(y, y_pred_1)}""",
xy=(1, 1), xytext=(-12, -12), va='top', ha='right',
xycoords='axes fraction', textcoords='offset points')
axes[1].clear()
plot_decision_boundaries(kmeans10, X_PCA_2, col, ax=axes[1])
axes[1].set_title(f"kmeans for random state {i}, max_iter=10")
axes[1].annotate(f"""
homogeneity_score: {homogeneity_score(y, y_pred_10)}
completeness_score: {completeness_score(y, y_pred_10)}
v_measure_score: {v_measure_score(y, y_pred_10)}""",
xy=(1, 1), xytext=(-12, -12), va='top', ha='right',
xycoords='axes fraction', textcoords='offset points')
axes[2].clear()
plot_decision_boundaries(kmeans300, X_PCA_2, col, ax=axes[2])
axes[2].set_title(f"kmeans for random state {i}, max_iter=300")
axes[2].annotate(f"""
homogeneity_score: {homogeneity_score(y, y_pred_300)}
completeness_score: {completeness_score(y, y_pred_300)}
v_measure_score: {v_measure_score(y, y_pred_300)}""",
xy=(1, 1), xytext=(-12, -12), va='top', ha='right',
xycoords='axes fraction', textcoords='offset points')
plt.tight_layout()
plt.show()
In the graphs above, the colored points our labels, the x's are the centroids and the lines are the decision boundaries for each label.
pca = PCA(n_components=0.95)
low = pca.fit_transform(X)
low.shape
(153064, 24)
k = 8
homogeneity_scores = []
completeness_scores = []
v_measure_scores = []
for i in [42, 777, 1234]:
kmeans300 = KMeans(n_clusters=k, random_state=i, n_init=10, max_iter=300)
kmeans300.fit(low)
y_pred_300 = kmeans300.predict(low)
homogeneity_scores.append(homogeneity_score(y, y_pred_300))
completeness_scores.append(completeness_score(y, y_pred_300))
v_measure_scores.append(v_measure_score(y, y_pred_300))
print(f"homogeneity_scores: {homogeneity_scores}\t mean: {np.mean(homogeneity_scores)}")
print(f"completeness_scores: {completeness_scores}\t mean: {np.mean(completeness_scores)}")
print(f"v_measure_scores: {v_measure_scores}\t\t mean: {np.mean(v_measure_scores)}")
homogeneity_scores: [0.30056590362428914, 0.30692794252012123, 0.30265663118413055] mean: 0.30338349244284696 completeness_scores: [0.36823037734485575, 0.36674136401404195, 0.3696466973289305] mean: 0.36820614622927605 v_measure_scores: [0.3309752140014537, 0.33417931083414854, 0.3328141313515469] mean: 0.3326562187290497
we can see that the 1234 random state, using a PCA with 24 features (95% of variance) works better than the others
pca.explained_variance_ratio_
array([0.19141727, 0.08275136, 0.07633586, 0.0611547 , 0.05323904,
0.05001338, 0.04734656, 0.04511688, 0.03811467, 0.03409679,
0.02761397, 0.02642667, 0.02431798, 0.02380349, 0.02372975,
0.02314458, 0.02225953, 0.02114544, 0.01793775, 0.01521524,
0.01429797, 0.01227503, 0.01047544, 0.00964015])
cumsum = np.cumsum(pca.explained_variance_ratio_)
cumsum = np.append(0, cumsum)
d = np.argmax(cumsum >= 0.95)
d
24
plt.figure(figsize=(10,6))
plt.grid(True, alpha=0.4, c="black")
plt.plot(cumsum, linewidth=3, c="teal")
plt.axis([0, 30, 0, 1])
plt.xlabel("Dimensions")
plt.ylabel("Explained Variance")
plt.yticks(np.linspace(0,1,21))
plt.xticks(np.linspace(0,30,31))
plt.title("Explained Variance of PCA")
plt.plot([d, d], [0, cumsum[-1]], "k:")
plt.plot([0, d], [cumsum[-1], cumsum[-1]], "k:")
plt.plot(d, cumsum[-1], "ko")
plt.axhline(y=0.85, c="r", linestyle="--")
plt.annotate("Elbow\n(>= 85% variance)", xy=(np.argmax(cumsum >= 0.85), 0.85),
xytext=(np.argmax(cumsum >= 0.85)+8, 0.5),
arrowprops=dict(arrowstyle="->", color = "black"), fontsize=16)
# save_fig("explained_variance_plot")
plt.show()
According to this graph, the "Elbow" (which is when the explained variance for PCA reaches 85%) is at 17 dimensions. Additionally, the 95% explained variance -which we have selected- is reached with 24 dimensions.
k = 8
homogeneity_scores = []
completeness_scores = []
v_measure_scores = []
for i in [42, 777, 1234]:
kmeans300 = KMeans(n_clusters=k, random_state=i, n_init=10, max_iter=300)
kmeans300.fit(X)
y_pred_300 = kmeans300.predict(X)
homogeneity_scores.append(homogeneity_score(y, y_pred_300))
completeness_scores.append(completeness_score(y, y_pred_300))
v_measure_scores.append(v_measure_score(y, y_pred_300))
print(f"homogeneity_scores: {homogeneity_scores}\t mean: {np.mean(homogeneity_scores)}")
print(f"completeness_scores: {completeness_scores}\t mean: {np.mean(completeness_scores)}")
print(f"v_measure_scores: {v_measure_scores}\t\t mean: {np.mean(v_measure_scores)}")
homogeneity_scores: [0.2669532676877738, 0.2245389988080147, 0.3562568737666425] mean: 0.28258304675414364 completeness_scores: [0.3323773756505023, 0.3155165607007077, 0.4131955445354732] mean: 0.3536964936288944 v_measure_scores: [0.296094409727378, 0.2623647563651861, 0.3826195082350512] mean: 0.3136928914425384
PCA is okay. We would need 24 dimensions to keep 95% of the variance. The PCA in two dimensions is not really helpful since it maintains less than 30% of the variance.
We chose k=8 because the number of labels are 8.
Both take extremely long, we weren't able to run them. Ideally, we would have done the same analysis that was done for pca.
from sklearn.manifold import TSNE
X_tsne = TSNE(n_components=2, random_state=42).fit_transform(X)
with open('data/reduced/test.npy', 'wb') as f:
np.save(f, X_tsne)
!pip3 install umap-learn
import umap.umap_ as umap
embedding = umap.UMAP(n_jobs=-1).fit_transform(X)
with open('data/reduced/umap.npy', 'wb') as f:
np.save(f, embedding)
Uses data distribution to first set the centroids, instead of randomly plotting them.
from sklearn.cluster import kmeans_plusplus
homogeneity_scores = []
completeness_scores = []
v_measure_scores = []
k = 8
accuracy_scores = []
for i in [42, 777, 1234]:
centroids, indices = kmeans_plusplus(X, n_clusters = k, random_state=i)
kmeans = KMeans(n_clusters=k, random_state=i, n_init='auto')
kmeans.fit(centroids)
y_pred = kmeans.predict(X)
homogeneity_scores.append(homogeneity_score(y, y_pred))
completeness_scores.append(completeness_score(y, y_pred))
v_measure_scores.append(v_measure_score(y, y_pred))
print(f"homogeneity_scores: {homogeneity_scores}\t mean: {np.mean(homogeneity_scores)}")
print(f"completeness_scores: {completeness_scores}\t mean: {np.mean(completeness_scores)}")
print(f"v_measure_scores: {v_measure_scores}\t mean: {np.mean(v_measure_scores)}")
homogeneity_scores: [0.23574583121575773, 0.23027245094576904, 0.2272843275507588] mean: 0.23110086990409517 completeness_scores: [0.29582280918189474, 0.40363199123803906, 0.3068577598686748] mean: 0.3354375200962029 v_measure_scores: [0.2623894215843747, 0.29324712596209834, 0.26114384635905963] mean: 0.27226013130184423
KMeans++ does not get better. We can see this through the low scores above
By looking at the distortion score according to the number of clusters, we can determine the best number of clusters by finding the elbow within the graph.
!pip3 install yellowbrick
from yellowbrick.cluster import KElbowVisualizer
model = KMeans(random_state=42, n_init=10, max_iter=300)
visualizer = KElbowVisualizer(model, k=(2,34))
visualizer.fit(X)
visualizer.show()
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
Our optimal number of clusters k=19, but we can also see that the "elbow shape" is not very well defined
n_clusters = list(range(2, 34))
homogeneity_scores = []
completeness_scores = []
v_measure_scores = []
ch_scores = []
for k in n_clusters:
t_start = time.time()
kmeans = KMeans(n_clusters=k, n_init = 'auto', random_state=42).fit(X)
y_pred = kmeans.predict(X)
homogeneity_scores.append(homogeneity_score(y, y_pred))
completeness_scores.append(completeness_score(y, y_pred))
v_measure_scores.append(v_measure_score(y, y_pred))
ch_scores.append(calinski_harabasz_score(X, y_pred))
t_stop = time.time()
print(f"Elapsed time: {t_stop - t_start:.5f} seconds (k: {k:2d})")
Elapsed time: 0.56051 seconds (k: 2) Elapsed time: 0.52666 seconds (k: 3) Elapsed time: 0.65544 seconds (k: 4) Elapsed time: 0.65302 seconds (k: 5) Elapsed time: 0.50405 seconds (k: 6) Elapsed time: 0.50621 seconds (k: 7) Elapsed time: 0.68009 seconds (k: 8) Elapsed time: 0.62339 seconds (k: 9) Elapsed time: 0.53888 seconds (k: 10) Elapsed time: 0.54268 seconds (k: 11) Elapsed time: 0.53101 seconds (k: 12) Elapsed time: 0.55879 seconds (k: 13) Elapsed time: 0.57854 seconds (k: 14) Elapsed time: 0.76132 seconds (k: 15) Elapsed time: 0.74158 seconds (k: 16) Elapsed time: 0.75521 seconds (k: 17) Elapsed time: 0.80725 seconds (k: 18) Elapsed time: 0.79295 seconds (k: 19) Elapsed time: 0.80438 seconds (k: 20) Elapsed time: 0.84930 seconds (k: 21) Elapsed time: 0.76353 seconds (k: 22) Elapsed time: 0.85016 seconds (k: 23) Elapsed time: 0.86430 seconds (k: 24) Elapsed time: 1.38692 seconds (k: 25) Elapsed time: 1.81350 seconds (k: 26) Elapsed time: 1.56237 seconds (k: 27) Elapsed time: 1.54686 seconds (k: 28) Elapsed time: 1.54615 seconds (k: 29) Elapsed time: 1.51521 seconds (k: 30) Elapsed time: 1.78548 seconds (k: 31) Elapsed time: 1.47663 seconds (k: 32) Elapsed time: 1.55394 seconds (k: 33)
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(20, 5))
ax[0, 0].plot(n_clusters, homogeneity_scores)
ax[0, 0].set_title('KMeans homogeneity for varying eps values')
ax[0, 0].set_xlabel('n_clusters')
ax[0, 0].set_ylabel('Homogeneity score')
ax[0, 0].set_xticks(n_clusters)
ax[0, 0].grid(visible=True)
ax[0, 1].plot(n_clusters, completeness_scores)
ax[0, 1].set_title('KMeans completeness for varying eps values')
ax[0, 1].set_xlabel('n_clusters')
ax[0, 1].set_ylabel('Completeness score')
ax[0, 1].set_xticks(n_clusters)
ax[0, 1].grid(visible=True)
ax[1, 0].plot(n_clusters, v_measure_scores)
ax[1, 0].set_title('KMeans v_measure for varying eps values')
ax[1, 0].set_xlabel('n_clusters')
ax[1, 0].set_ylabel('V_measure score')
ax[1, 0].set_xticks(n_clusters)
ax[1, 0].grid(visible=True)
ax[1, 1].plot(n_clusters, ch_scores)
ax[1, 1].set_title('KMeans ch_scores for varying eps values')
ax[1, 1].set_xlabel('n_clusters')
ax[1, 1].set_ylabel('Ch_scores score')
ax[1, 1].set_xticks(n_clusters)
ax[1, 1].grid(visible=True)
plt.tight_layout()
plt.show()
By following homogeneity, completeness, and v_measure, clustering tends to perform better with a higher number of clusters (except for completeness, which slightly falls after k=7 clusters).
In DBScan, the eps_values are the radius around the mean points created.
eps_values = [0.1, 0.3, 0.5, 0.7, 1.0]
homogeneity_scores = []
completeness_scores = []
v_measure_scores = []
ch_scores = []
for eps in eps_values:
# Register start time
t_start = time.time()
dbscan = DBSCAN(eps=eps, n_jobs=-1).fit(X)
y_pred = dbscan.labels_
homogeneity_scores.append(homogeneity_score(y, y_pred))
completeness_scores.append(completeness_score(y, y_pred))
v_measure_scores.append(v_measure_score(y, y_pred))
ch_scores.append(calinski_harabasz_score(X, y_pred))
# Register end time
t_stop = time.time()
# Print elapsed time
print(f"Elapsed time: {t_stop - t_start:.5f} seconds (k: {eps})")
Elapsed time: 53.72845 seconds (k: 0.1) Elapsed time: 71.60856 seconds (k: 0.3) Elapsed time: 103.42766 seconds (k: 0.5) Elapsed time: 59.01302 seconds (k: 0.7) Elapsed time: 62.92873 seconds (k: 1.0)
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(20, 5))
ax[0, 0].plot(eps_values, homogeneity_scores)
ax[0, 0].set_xticks(eps_values)
ax[0, 0].set_title('DBSCAN homogeneity for varying eps values')
ax[0, 0].set_xlabel('eps')
ax[0, 0].set_ylabel('Homogeneity score')
ax[0, 0].set_xticks(eps_values)
ax[0, 0].grid(visible=True)
ax[0, 1].plot(eps_values, completeness_scores)
ax[0, 1].set_xticks(eps_values)
ax[0, 1].set_title('DBSCAN completeness for varying eps values')
ax[0, 1].set_xlabel('eps')
ax[0, 1].set_ylabel('Completeness score')
ax[0, 1].set_xticks(eps_values)
ax[0, 1].grid(visible=True)
ax[1, 0].plot(eps_values, v_measure_scores)
ax[1, 0].set_xticks(eps_values)
ax[1, 0].set_title('DBSCAN v_measure for varying eps values')
ax[1, 0].set_xlabel('eps')
ax[1, 0].set_ylabel('V_measure score')
ax[1, 0].set_xticks(eps_values)
ax[1, 0].grid(visible=True)
ax[1, 1].plot(eps_values, ch_scores)
ax[1, 1].set_xticks(eps_values)
ax[1, 1].set_title('DBSCAN ch_scores for varying eps values')
ax[1, 1].set_xlabel('eps')
ax[1, 1].set_ylabel('Ch_scores score')
ax[1, 1].set_xticks(eps_values)
ax[1, 1].grid(visible=True)
plt.tight_layout()
plt.show()
We can see an upward trend in both homogeneity score and DBScan when increasing the eps score. On the other hand v_measure and completeness worsen with higher eps values.
Hierarchical clustering on the entire dataset is extremely slow, because of that we we took a smaller sample of the dataset to work with, with a size of 10000 points.
X_small, _, y_small, _ = train_test_split(X, y, train_size=10000, random_state=42)
from collections import Counter
Counter(y_small)
Counter({3: 1305,
5: 1292,
6: 885,
1: 1261,
2: 1306,
4: 1295,
0: 1310,
7: 1346})
X_small.shape
(10000, 42)
from scipy.cluster.hierarchy import linkage, dendrogram
clustering = AgglomerativeClustering(n_clusters=8)
clustering.fit(X_small)
linkage_matrix = linkage(clustering.children_, method='single')
labels = clustering.fit_predict(X_small)
Counter(labels)
Counter({1: 2109, 0: 2880, 3: 2793, 6: 1732, 4: 164, 7: 291, 2: 22, 5: 9})
print("Completeness score: ", completeness_score(y_small, labels))
print("v_measure_score: ", v_measure_score(y_small, labels))
print("Homogeneity score: ", homogeneity_score(y_small, labels))
Completeness score: 0.28742295699392867 v_measure_score: 0.2447527911377867 Homogeneity score: 0.21311431863211572
Does not work well, we see that all three score have values < 0.30. Hierarchical clustering doesn't work because we already know the amount of clusters we need = 8.
We noticed that unsupervised learning for the dataset we selected doesn't work as well as supervised. There can be a few reasons for this:
We will now train and test some of the classifiers on the data labeled either as Benign or Malign, instead then on the 8.
Some of the following code is a repetition from before, this is because originally we were using multiple notebooks and preferred to keep the structure to run single sections if needed.
!pip3 install imblearn
Collecting imblearn
Using cached imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
Downloading imbalanced_learn-0.11.0-py3-none-any.whl (235 kB)
|████████████████████████████████| 235 kB 1.9 MB/s eta 0:00:01
Requirement already satisfied: threadpoolctl>=2.0.0 in /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (3.2.0)
Requirement already satisfied: joblib>=1.1.1 in /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.2.0)
Requirement already satisfied: scikit-learn>=1.0.2 in /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.3.0)
Requirement already satisfied: scipy>=1.5.0 in /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.10.0)
Requirement already satisfied: numpy>=1.17.3 in /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.25.2)
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.11.0 imblearn-0.0
WARNING: You are using pip version 20.2.3; however, version 23.3.1 is available.
You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3.9 -m pip install --upgrade pip' command.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy import stats
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neural_network import MLPClassifier
import time
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix,
classification_report,
ConfusionMatrixDisplay,
precision_recall_curve,
average_precision_score,
roc_curve,
roc_auc_score,
RocCurveDisplay,
auc
)
from sklearn.calibration import CalibrationDisplay
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_validate
from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
df_train = pd.read_csv("data/df_train.csv")
df_test = pd.read_csv("data/df_test.csv")
print(df_train.shape)
df_test.shape
(55749, 32)
(16000, 32)
df_train.columns
Index(['flow_duration', 'Header_Length', 'Protocol Type', 'Duration', 'Rate',
'fin_flag_number', 'syn_flag_number', 'rst_flag_number',
'psh_flag_number', 'ack_flag_number', 'ack_count', 'syn_count',
'fin_count', 'urg_count', 'rst_count', 'HTTP', 'HTTPS', 'SSH', 'TCP',
'UDP', 'ICMP', 'Tot sum', 'Min', 'Max', 'AVG', 'Std', 'Tot size', 'IAT',
'Covariance', 'Variance', 'label', 'category_label'],
dtype='object')
X_train = df_train.drop(columns=["label", "category_label"])
X_test = df_test.drop(columns=["label", "category_label"])
y_train = df_train["category_label"]
y_test = df_test["category_label"]
Let's divide the dataset into train and test sets. In this case, we are using 20% of the dataset as test.
Now we are scaling values in the train and the test set (everything in this case, because there are only numerical features) to have better performances and predictions.
standard_scaler = StandardScaler().fit(X_train)
X_train = standard_scaler.transform(X_train)
X_test = standard_scaler.transform(X_test)
Let's save the standars scaler also for the future
#dump(standard_scaler, "scalers/standard_scaler.bin", compress=True)
X_train
array([[-0.36801019, -0.35346166, 0.46054157, ..., -0.01018213,
-0.35051081, -1.27017796],
[-0.36888444, -0.38354661, -0.52506693, ..., -0.01058209,
-0.35051081, -1.27017796],
[-0.36888444, -0.38354661, -0.52506693, ..., -0.00643102,
-0.35051081, -1.27017796],
...,
[ 0.36874052, -0.37683062, -0.33690531, ..., 1.26147349,
-0.34579192, 0.99974844],
[-0.11088783, -0.32960377, -0.52506693, ..., 1.2614745 ,
1.57512056, 0.99974844],
[ 0.73144974, -0.33881354, -0.22938438, ..., -1.27568763,
-0.34866664, 0.7727558 ]])
X_test
array([[-0.3594684 , 3.34878602, -0.52506693, ..., 1.26151154,
3.37066307, 0.99974844],
[-0.33873717, -0.38126271, 0.46054157, ..., -1.27568763,
-0.3451429 , 0.7727558 ],
[-0.36784072, -0.38318251, -0.52506693, ..., -1.27568763,
-0.35007724, 0.7727558 ],
...,
[ 0.53284477, -0.34475238, -0.47130647, ..., 1.26163369,
-0.34559026, 0.99974844],
[ 0.09184591, -0.32815799, 0.26341987, ..., -1.27568763,
-0.34286041, 0.7727558 ],
[-0.36888444, -0.38354661, -0.52506693, ..., -0.00644665,
-0.35051081, -1.27017796]])
label2id = {'BenignTraffic': 0, 'Mirai': 1, 'DDoS': 1, 'Spoofing': 1, 'Web': 1, 'Recon': 1, 'BruteForce': 1, 'DoS': 1}
id2label = {0: 'BenignTraffic', 1: 'MaliciousTraffic'}
y_train = [label2id[label] for label in y_train]
y_test = [label2id[label] for label in y_test]
Counter(y_train)
Counter({1: 48731, 0: 7018})
because the classes are very much unbalanced, we'll perform some undersampling
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy="auto", random_state=42)
X_res, y_res = rus.fit_resample(X_train, pd.Series( (v for v in y_train) ))
print(X_res.shape, y_res.shape)
print(y_res.value_counts())
ax = y_res.value_counts().plot.pie(autopct='%.2f')
_ = ax.set_title("Under-sampling")
(14036, 30) (14036,) 0 7018 1 7018 dtype: int64
Now our data is balanced. We undersampled our malicious values to be the same amount benign values.
df_test_final = pd.read_csv("data/final_test_data.csv")
df_test_final = df_test_final[df_train.columns]
X_test_final = df_test_final.drop(columns=["label", "category_label"])
y_test_final = df_test_final["category_label"]
X_test_final = standard_scaler.transform(X_test_final)
label2id = {'BenignTraffic': 0, 'Mirai': 1, 'DDoS': 1, 'Spoofing': 1, 'Web': 1, 'Recon': 1, 'BruteForce': 1, 'DoS': 1}
id2label = {0: 'BenignTraffic', 1: 'MaliciousTraffic'}
y_test_final = [label2id[label] for label in y_test_final]
label2id = {'BenignTraffic': 0, 'MaliciousTraffic': 1}
X_test_final
array([[-0.36888444, -0.38354661, -0.52506693, ..., -0.00637693,
-0.35051081, -1.27017796],
[-0.36858791, -0.35326693, 0.46054157, ..., -0.01005686,
-0.35051081, -1.27017796],
[-0.3686571 , -0.36995928, 0.46054157, ..., -0.01019597,
-0.35049261, -1.15668164],
...,
[ 0.46692148, -0.34883984, -0.13082353, ..., 1.26109575,
-0.04195424, 0.99974844],
[-0.36870301, -0.3834317 , -0.417546 , ..., 1.26147301,
-0.34580619, 0.99974844],
[-0.10955779, -0.31405063, -0.48922662, ..., 1.26147385,
-0.3486514 , 0.99974844]])
df_test_final.columns
Index(['flow_duration', 'Header_Length', 'Protocol Type', 'Duration', 'Rate',
'fin_flag_number', 'syn_flag_number', 'rst_flag_number',
'psh_flag_number', 'ack_flag_number', 'ack_count', 'syn_count',
'fin_count', 'urg_count', 'rst_count', 'HTTP', 'HTTPS', 'SSH', 'TCP',
'UDP', 'ICMP', 'Tot sum', 'Min', 'Max', 'AVG', 'Std', 'Tot size', 'IAT',
'Covariance', 'Variance', 'label', 'category_label'],
dtype='object')
Counter(y_test_final)
Counter({1: 63064, 0: 10000})
Let's now try the various classifiers
models = {'logistic': LogisticRegression(random_state=42, max_iter=1000),
'knn': KNeighborsClassifier(n_neighbors=10),
'decision_tree': DecisionTreeClassifier(random_state=42),
'random_forest': RandomForestClassifier(random_state=42),
'mlp': MLPClassifier((10, 10, 5), random_state=42),
'deeper_mlp': MLPClassifier((10, 10, 100, 100, 100, 100, 50, 50, 50, 10), early_stopping=True, random_state=42)}
predictions = {'logistic': None,
'knn': None,
'random_forest': None,
'decision_tree': None,
'mlp': None,
'deeper_mlp': None}
for i, (model_name, model) in enumerate(models.items()):
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
model.fit(X_res, y_res)
dump(model, f"models/binary/{model_name}.joblib")
y_pred = model.predict(X_test)
predictions[model_name] = y_pred
# Classification Report
print(f"Classification Report for {model_name}:")
print(classification_report(y_test, y_pred))
# Confusion Matrix
disp_cm = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, y_pred, normalize='true'), display_labels=label2id)
disp_cm.plot(cmap='Blues',ax=axes[0])
axes[0].set_title(f"Confusion Matrix for {model_name}")
# ROC Curve
y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
disp_roc = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc, estimator_name=model_name)
disp_roc.plot(ax=axes[1])
axes[1].set_title(f'ROC Curve for {model_name}')
plt.tight_layout()
plt.show()
Classification Report for logistic:
precision recall f1-score support
0 0.43 0.85 0.57 2036
1 0.97 0.84 0.90 13964
accuracy 0.84 16000
macro avg 0.70 0.84 0.74 16000
weighted avg 0.91 0.84 0.86 16000
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/threadpoolctl.py:1019: RuntimeWarning: libc not found. The ctypes module in Python 3.9 is maybe too old for this OS. warnings.warn(
Classification Report for knn:
precision recall f1-score support
0 0.39 0.87 0.54 2036
1 0.98 0.80 0.88 13964
accuracy 0.81 16000
macro avg 0.68 0.83 0.71 16000
weighted avg 0.90 0.81 0.84 16000
Classification Report for decision_tree:
precision recall f1-score support
0 0.68 0.91 0.77 2036
1 0.99 0.94 0.96 13964
accuracy 0.93 16000
macro avg 0.83 0.92 0.87 16000
weighted avg 0.95 0.93 0.94 16000
Classification Report for random_forest:
precision recall f1-score support
0 0.73 0.90 0.81 2036
1 0.99 0.95 0.97 13964
accuracy 0.95 16000
macro avg 0.86 0.93 0.89 16000
weighted avg 0.95 0.95 0.95 16000
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. warnings.warn(
Classification Report for mlp:
precision recall f1-score support
0 0.54 0.84 0.66 2036
1 0.98 0.90 0.93 13964
accuracy 0.89 16000
macro avg 0.76 0.87 0.80 16000
weighted avg 0.92 0.89 0.90 16000
Classification Report for deeper_mlp:
precision recall f1-score support
0 0.54 0.81 0.65 2036
1 0.97 0.90 0.93 13964
accuracy 0.89 16000
macro avg 0.75 0.85 0.79 16000
weighted avg 0.91 0.89 0.90 16000
for i, (model_name, model) in enumerate(models.items()):
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
model = load(f"models/binary/{model_name}.joblib")
y_pred = model.predict(X_test_final)
# Classification Report
print(f"Classification Report for {model_name}:")
print(classification_report(y_test_final, y_pred))
# Confusion Matrix
disp_cm = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test_final, y_pred, normalize='true'), display_labels=label2id)
disp_cm.plot(cmap='Blues',ax=axes[0])
axes[0].set_title(f"Confusion Matrix for {model_name}")
# ROC Curve
y_prob = model.predict_proba(X_test_final)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test_final, y_prob)
roc_auc = auc(fpr, tpr)
disp_roc = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc, estimator_name=model_name)
disp_roc.plot(ax=axes[1])
axes[1].set_title(f'ROC Curve for {model_name}')
plt.tight_layout()
plt.show()
Classification Report for logistic:
precision recall f1-score support
0 0.47 0.85 0.61 10000
1 0.97 0.85 0.91 63064
accuracy 0.85 73064
macro avg 0.72 0.85 0.76 73064
weighted avg 0.90 0.85 0.87 73064
Classification Report for knn:
precision recall f1-score support
0 0.43 0.87 0.58 10000
1 0.98 0.82 0.89 63064
accuracy 0.82 73064
macro avg 0.70 0.85 0.73 73064
weighted avg 0.90 0.82 0.85 73064
Classification Report for decision_tree:
precision recall f1-score support
0 0.71 0.90 0.79 10000
1 0.98 0.94 0.96 63064
accuracy 0.94 73064
macro avg 0.85 0.92 0.88 73064
weighted avg 0.95 0.94 0.94 73064
Classification Report for random_forest:
precision recall f1-score support
0 0.77 0.90 0.83 10000
1 0.98 0.96 0.97 63064
accuracy 0.95 73064
macro avg 0.88 0.93 0.90 73064
weighted avg 0.96 0.95 0.95 73064
Classification Report for mlp:
precision recall f1-score support
0 0.58 0.84 0.68 10000
1 0.97 0.90 0.94 63064
accuracy 0.89 73064
macro avg 0.77 0.87 0.81 73064
weighted avg 0.92 0.89 0.90 73064
Classification Report for deeper_mlp:
precision recall f1-score support
0 0.57 0.82 0.67 10000
1 0.97 0.90 0.93 63064
accuracy 0.89 73064
macro avg 0.77 0.86 0.80 73064
weighted avg 0.91 0.89 0.90 73064
We noticed that classifying malicious labels works better than classifying the benign ones - which could be better in the context of our data (better to detect malicious labels). Above we see that decision trees and random forest perform better.
Remember that the y_final test is highly unbalanced, and we see that the binary classifiers, without too much of an analysis, perform quite well\ Remember that you tried to create a classifier with the unbalanced X_train, and it performed quite worse in the BenignTraffic (the one that is under represented)
We will look at hard voting and soft voting, one looks at the models and it selects one that has the majority of the classification. Soft voting on the other hand looks at the probabilities of each.
models = {'logistic': LogisticRegression(random_state=42, max_iter=1000),
'knn': KNeighborsClassifier(n_neighbors=10),
'decision_tree': DecisionTreeClassifier(random_state=42),
'random_forest': RandomForestClassifier(random_state=42),
'mlp': MLPClassifier((10, 10, 5), random_state=42)}
clf_hard = VotingClassifier(estimators=[(name, model) for name, model in models.items()], voting='hard', verbose=True, n_jobs=-1)
clf_soft = VotingClassifier(estimators=[(name, model) for name, model in models.items()], voting='soft', verbose=True, n_jobs=-1)
clf_hard.fit(X_res, y_res)
[Voting] ...................... (2 of 5) Processing knn, total= 0.0s [Voting] ................. (1 of 5) Processing logistic, total= 0.2s [Voting] ............ (3 of 5) Processing decision_tree, total= 0.2s [Voting] ............ (4 of 5) Processing random_forest, total= 2.9s [Voting] ...................... (5 of 5) Processing mlp, total= 7.7s
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. warnings.warn(
VotingClassifier(estimators=[('logistic',
LogisticRegression(max_iter=1000,
random_state=42)),
('knn', KNeighborsClassifier(n_neighbors=10)),
('decision_tree',
DecisionTreeClassifier(random_state=42)),
('random_forest',
RandomForestClassifier(random_state=42)),
('mlp',
MLPClassifier(hidden_layer_sizes=(10, 10, 5),
random_state=42))],
n_jobs=-1, verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. VotingClassifier(estimators=[('logistic',
LogisticRegression(max_iter=1000,
random_state=42)),
('knn', KNeighborsClassifier(n_neighbors=10)),
('decision_tree',
DecisionTreeClassifier(random_state=42)),
('random_forest',
RandomForestClassifier(random_state=42)),
('mlp',
MLPClassifier(hidden_layer_sizes=(10, 10, 5),
random_state=42))],
n_jobs=-1, verbose=True)LogisticRegression(max_iter=1000, random_state=42)
KNeighborsClassifier(n_neighbors=10)
DecisionTreeClassifier(random_state=42)
RandomForestClassifier(random_state=42)
MLPClassifier(hidden_layer_sizes=(10, 10, 5), random_state=42)
dump(clf_hard, f"models/binary/voting_hard.joblib")
['models/binary/voting_hard.joblib']
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
y_pred = clf_hard.predict(X_test)
predictions[model_name] = y_pred
# Classification Report
print(f"Classification Report for Hard Voting:")
print(classification_report(y_test, y_pred))
# Confusion Matrix
disp_cm = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, y_pred, normalize='true'), display_labels=label2id)
disp_cm.plot(cmap='Blues',ax=axes[0])
axes[0].set_title(f"Confusion Matrix for Hard Voting")
# ROC Curve
y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
disp_roc = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc, estimator_name=model_name)
disp_roc.plot(ax=axes[1])
axes[1].set_title(f'ROC Curve for Hard Voting')
plt.tight_layout()
plt.show()
Classification Report for Hard Voting:
precision recall f1-score support
0 0.57 0.87 0.69 2036
1 0.98 0.90 0.94 13964
accuracy 0.90 16000
macro avg 0.77 0.89 0.81 16000
weighted avg 0.93 0.90 0.91 16000
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
y_pred = clf_hard.predict(X_test_final)
# Classification Report
print(f"Classification Report for Voting Hard:")
print(classification_report(y_test_final, y_pred))
# Confusion Matrix
disp_cm = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test_final, y_pred, normalize='true'), display_labels=label2id)
disp_cm.plot(cmap='Blues',ax=axes[0])
axes[0].set_title(f"Confusion Matrix for Voting Hard")
# ROC Curve
y_prob = model.predict_proba(X_test_final)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test_final, y_prob)
roc_auc = auc(fpr, tpr)
disp_roc = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc, estimator_name=model_name)
disp_roc.plot(ax=axes[1])
axes[1].set_title(f'ROC Curve for Voting Hard')
plt.tight_layout()
plt.show()
Classification Report for Voting Hard:
precision recall f1-score support
0 0.61 0.88 0.72 10000
1 0.98 0.91 0.94 63064
accuracy 0.90 73064
macro avg 0.79 0.89 0.83 73064
weighted avg 0.93 0.90 0.91 73064
clf_soft.fit(X_res, y_res)
[Voting] ...................... (2 of 5) Processing knn, total= 0.0s [Voting] ............ (3 of 5) Processing decision_tree, total= 0.3s [Voting] ................. (1 of 5) Processing logistic, total= 0.2s [Voting] ............ (4 of 5) Processing random_forest, total= 3.2s [Voting] ...................... (5 of 5) Processing mlp, total= 7.6s
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. warnings.warn(
VotingClassifier(estimators=[('logistic',
LogisticRegression(max_iter=1000,
random_state=42)),
('knn', KNeighborsClassifier(n_neighbors=10)),
('decision_tree',
DecisionTreeClassifier(random_state=42)),
('random_forest',
RandomForestClassifier(random_state=42)),
('mlp',
MLPClassifier(hidden_layer_sizes=(10, 10, 5),
random_state=42))],
n_jobs=-1, verbose=True, voting='soft')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. VotingClassifier(estimators=[('logistic',
LogisticRegression(max_iter=1000,
random_state=42)),
('knn', KNeighborsClassifier(n_neighbors=10)),
('decision_tree',
DecisionTreeClassifier(random_state=42)),
('random_forest',
RandomForestClassifier(random_state=42)),
('mlp',
MLPClassifier(hidden_layer_sizes=(10, 10, 5),
random_state=42))],
n_jobs=-1, verbose=True, voting='soft')LogisticRegression(max_iter=1000, random_state=42)
KNeighborsClassifier(n_neighbors=10)
DecisionTreeClassifier(random_state=42)
RandomForestClassifier(random_state=42)
MLPClassifier(hidden_layer_sizes=(10, 10, 5), random_state=42)
dump(clf_hard, f"models/binary/voting_soft.joblib")
['models/binary/voting_soft.joblib']
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
y_pred = clf_soft.predict(X_test)
predictions[model_name] = y_pred
# Classification Report
print(f"Classification Report for Soft Voting:")
print(classification_report(y_test, y_pred))
# Confusion Matrix
disp_cm = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, y_pred, normalize='true'), display_labels=label2id)
disp_cm.plot(cmap='Blues',ax=axes[0])
axes[0].set_title(f"Confusion Matrix for Soft Voting")
# ROC Curve
y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
disp_roc = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc, estimator_name=model_name)
disp_roc.plot(ax=axes[1])
axes[1].set_title(f'ROC Curve for Soft Voting')
plt.tight_layout()
plt.show()
Classification Report for Soft Voting:
precision recall f1-score support
0 0.62 0.87 0.73 2036
1 0.98 0.92 0.95 13964
accuracy 0.92 16000
macro avg 0.80 0.90 0.84 16000
weighted avg 0.93 0.92 0.92 16000
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
y_pred = clf_soft.predict(X_test_final)
# Classification Report
print(f"Classification Report for Voting Soft:")
print(classification_report(y_test_final, y_pred))
# Confusion Matrix
disp_cm = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test_final, y_pred, normalize='true'), display_labels=label2id)
disp_cm.plot(cmap='Blues',ax=axes[0])
axes[0].set_title(f"Confusion Matrix for Voting Soft")
# ROC Curve
y_prob = model.predict_proba(X_test_final)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test_final, y_prob)
roc_auc = auc(fpr, tpr)
disp_roc = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc, estimator_name=model_name)
disp_roc.plot(ax=axes[1])
axes[1].set_title(f'ROC Curve for Voting Soft')
plt.tight_layout()
plt.show()
Classification Report for Voting Soft:
precision recall f1-score support
0 0.66 0.87 0.75 10000
1 0.98 0.93 0.95 63064
accuracy 0.92 73064
macro avg 0.82 0.90 0.85 73064
weighted avg 0.94 0.92 0.93 73064
Both hard and soft voting work almost the same.
Let's try with some of the techniques within anomaly detection
We also tried resampling so that the malicious are much less than the benign.
After performing the analysis both with scaled and not scaled data, we saw that the acuracy was usually a little better with scaled data.
df = pd.read_csv('data/full_balanced_data.csv')
X = df.drop(['category_label', 'label'], axis=1).copy()
y = df['category_label'].copy()
standard_scaler = StandardScaler()
X = standard_scaler.fit_transform(X)
label2id = {'BenignTraffic': 1, 'Mirai': -1, 'DDoS': -1, 'Spoofing': -1, 'Web': -1, 'Recon': -1, 'BruteForce': -1, 'DoS': -1}
id2label = {1: 'BenignTraffic', -1: 'MaliciousTraffic'}
y = [label2id[label] for label in y]
rus = RandomUnderSampler(sampling_strategy="auto", random_state=42)
X_res, y_res = rus.fit_resample(X, pd.Series( (v for v in y) ))
print('X_res, y_res')
print(X_res.shape, y_res.shape)
print(y_res.value_counts())
ax = y_res.value_counts().plot.pie(autopct='%.2f')
_ = ax.set_title("Under-sampling")
plt.show()
rus = RandomUnderSampler(sampling_strategy={-1: 5000})
X_res_small, y_res_small = rus.fit_resample(X, pd.Series( (v for v in y) ))
print('X_res_small, y_res_small')
print(X_res_small.shape, y_res_small.shape)
print(y_res_small.value_counts())
ax = y_res_small.value_counts().plot.pie(autopct='%.2f')
_ = ax.set_title("Under-sampling")
plt.show()
label2id = {'BenignTraffic': 1, 'MaliciousTraffic': -1}
X_res, y_res (40000, 46) (40000,) -1 20000 1 20000 dtype: int64
X_res_small, y_res_small (25000, 46) (25000,) 1 20000 -1 5000 dtype: int64
outlier_fraction = 0.01
Fit the model on the cleaned data set
t_start = time.time()
y_pred = EllipticEnvelope(contamination=outlier_fraction).fit_predict(X)
t_stop = time.time()
print(f"Fitting time: {t_stop - t_start:.2f} s")
np.unique(y_pred, return_counts=True)
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/covariance/_robust_covariance.py:747: UserWarning: The covariance matrix associated to your dataset is not full rank warnings.warn(
Fitting time: 28.59 s
(array([-1, 1]), array([ 1531, 151533]))
print('Accuracy on full dataset')
accuracy_score(y, y_pred)
Accuracy on full dataset
0.13052709977525742
The accuracy score is extremely low, as expected anomaly detection on the entire dataset does not work well.
t_start = time.time()
y_pred = EllipticEnvelope(contamination=outlier_fraction).fit_predict(X_res)
t_stop = time.time()
print(f"Fitting time: {t_stop - t_start:.2f} s")
np.unique(y_pred, return_counts=True)
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/covariance/_robust_covariance.py:747: UserWarning: The covariance matrix associated to your dataset is not full rank warnings.warn(
Fitting time: 7.68 s
(array([-1, 1]), array([ 400, 39600]))
Looking at the smaller dataset, our accuracy score slightly improves. This is because more of the benign values are captured, automatically leading to a higher accuracy score. But the anomalies are very few. This could be because our dataset doesn't really have "anomalies" intended as in these techniques.
print('Accuracy on smaller dataset')
accuracy_score(y_res, y_pred)
Accuracy on smaller dataset
0.5074
t_start = time.time()
y_pred = EllipticEnvelope(contamination=outlier_fraction).fit_predict(X_res_small)
t_stop = time.time()
print(f"Fitting time: {t_stop - t_start:.2f} s")
np.unique(y_pred, return_counts=True)
print('Accuracy on even smaller dataset')
accuracy_score(y_res_small, y_pred)
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/covariance/_robust_covariance.py:747: UserWarning: The covariance matrix associated to your dataset is not full rank warnings.warn(
Fitting time: 5.12 s Accuracy on even smaller dataset
0.79456
Because the ellipticenvelope tries to find the normally distributed data, and point the outliers as anomalies, and in our dataset the majority of the data is "an anomaly" the methods don't work well.
t_start = time.time()
y_pred_svm = OneClassSVM(nu=outlier_fraction, kernel="rbf", gamma=0.1).fit(X).predict(X)
t_stop = time.time()
print(f"Fitting time: {t_stop - t_start:.2f} s")
print(np.unique(y_pred_svm, return_counts=True))
accuracy_score(y, y_pred_svm)
Fitting time: 215.15 s (array([-1, 1]), array([ 2233, 150831]))
0.14040532065018554
t_start = time.time()
y_pred_svm = OneClassSVM(nu=outlier_fraction, kernel="rbf", gamma=0.1).fit(X_res).predict(X_res)
t_stop = time.time()
print(f"Fitting time: {t_stop - t_start:.2f} s")
print(np.unique(y_pred_svm, return_counts=True))
accuracy_score(y_res, y_pred_svm)
Fitting time: 22.94 s (array([-1, 1]), array([ 1385, 38615]))
0.513725
t_start = time.time()
y_pred_svm = OneClassSVM(nu=outlier_fraction, kernel="rbf", gamma=0.1).fit(X_res_small).predict(X_res_small)
t_stop = time.time()
print(f"Fitting time: {t_stop - t_start:.2f} s")
print(np.unique(y_pred_svm, return_counts=True))
accuracy_score(y_res_small, y_pred_svm)
Fitting time: 8.08 s (array([-1, 1]), array([ 696, 24304]))
0.79064
t_start = time.time()
y_pred_svm = IsolationForest(contamination=outlier_fraction, random_state=42).fit(X).predict(X)
t_stop = time.time()
print(f"Fitting time: {t_stop - t_start:.2f} s")
print(np.unique(y_pred_svm, return_counts=True))
accuracy_score(y, y_pred_svm)
Fitting time: 1.71 s (array([-1, 1]), array([ 1531, 151533]))
0.13664218888830815
t_start = time.time()
y_pred_svm = IsolationForest(contamination=outlier_fraction, random_state=42).fit(X_res).predict(X_res)
t_stop = time.time()
print(f"Fitting time: {t_stop - t_start:.2f} s")
print(np.unique(y_pred_svm, return_counts=True))
accuracy_score(y_res, y_pred_svm)
Fitting time: 0.78 s (array([-1, 1]), array([ 400, 39600]))
0.49765
t_start = time.time()
y_pred_svm = IsolationForest(contamination=outlier_fraction, random_state=42).fit(X_res_small).predict(X_res_small)
t_stop = time.time()
print(f"Fitting time: {t_stop - t_start:.2f} s")
print(np.unique(y_pred_svm, return_counts=True))
accuracy_score(y_res_small, y_pred_svm)
Fitting time: 0.47 s (array([-1, 1]), array([ 250, 24750]))
0.79408
t_start = time.time()
y_pred_svm = LocalOutlierFactor(n_neighbors=35, contamination=outlier_fraction).fit_predict(X)
t_stop = time.time()
print(f"Fitting time: {t_stop - t_start:.2f} s")
print(np.unique(y_pred_svm, return_counts=True))
accuracy_score(y, y_pred_svm)
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/threadpoolctl.py:1019: RuntimeWarning: libc not found. The ctypes module in Python 3.9 is maybe too old for this OS. warnings.warn(
Fitting time: 55.56 s (array([-1, 1]), array([ 1531, 151533]))
0.14053598494747296
t_start = time.time()
y_pred_svm = LocalOutlierFactor(n_neighbors=35, contamination=outlier_fraction).fit_predict(X_res)
t_stop = time.time()
print(f"Fitting time: {t_stop - t_start:.2f} s")
print(np.unique(y_pred_svm, return_counts=True))
accuracy_score(y_res, y_pred_svm)
Fitting time: 4.89 s (array([-1, 1]), array([ 400, 39600]))
0.51
t_start = time.time()
y_pred_svm = LocalOutlierFactor(n_neighbors=35, contamination=outlier_fraction).fit_predict(X_res_small)
t_stop = time.time()
print(f"Fitting time: {t_stop - t_start:.2f} s")
print(np.unique(y_pred_svm, return_counts=True))
accuracy_score(y_res_small, y_pred_svm)
Fitting time: 2.00 s (array([-1, 1]), array([ 250, 24750]))
0.81
We can see that anomaly detection doesn't work well, this is even if some times the accuracy score can be quite high (81%). This is because even if the accuracy are quite high, it fails to find many anomalies (for the datasrt with very few anomalies, it is on the line of 250/5000). So the accuracy is high because the majority of the data is benign and it is correctly identified as benign.
With our assignment, for this specific dataset, a paper (can be found at this link: https://www.mdpi.com/1424-8220/23/13/5941#B99-sensors-23-05941), they were able to get much better results than us. Reading the paper, we don't think they balanced the dataset in any way, leading to better results. To test our hypothesis, we will create some simple classifiers (similar to the one used in the paper) using the 5% dataset (that have labels distributed unevenly) and see if this is the case.
We will also use the standard scaler, as they mention is what they used.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy import stats
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
import time
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix,
classification_report,
ConfusionMatrixDisplay,
precision_recall_curve,
average_precision_score,
roc_curve,
roc_auc_score
)
from sklearn.calibration import CalibrationDisplay
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
df = pd.read_csv('data/IoT_5Percent.csv')
dict_8classes = {}
dict_8classes['DDoS-RSTFINFlood'] = 'DDoS'
dict_8classes['DDoS-PSHACK_Flood'] = 'DDoS'
dict_8classes['DDoS-SYN_Flood'] = 'DDoS'
dict_8classes['DDoS-UDP_Flood'] = 'DDoS'
dict_8classes['DDoS-TCP_Flood'] = 'DDoS'
dict_8classes['DDoS-ICMP_Flood'] = 'DDoS'
dict_8classes['DDoS-SynonymousIP_Flood'] = 'DDoS'
dict_8classes['DDoS-ACK_Fragmentation'] = 'DDoS'
dict_8classes['DDoS-UDP_Fragmentation'] = 'DDoS'
dict_8classes['DDoS-ICMP_Fragmentation'] = 'DDoS'
dict_8classes['DDoS-SlowLoris'] = 'DDoS'
dict_8classes['DDoS-HTTP_Flood'] = 'DDoS'
dict_8classes['DoS-UDP_Flood'] = 'DoS'
dict_8classes['DoS-SYN_Flood'] = 'DoS'
dict_8classes['DoS-TCP_Flood'] = 'DoS'
dict_8classes['DoS-HTTP_Flood'] = 'DoS'
dict_8classes['Mirai-greeth_flood'] = 'Mirai'
dict_8classes['Mirai-greip_flood'] = 'Mirai'
dict_8classes['Mirai-udpplain'] = 'Mirai'
dict_8classes['Recon-PingSweep'] = 'Recon'
dict_8classes['Recon-OSScan'] = 'Recon'
dict_8classes['Recon-PortScan'] = 'Recon'
dict_8classes['VulnerabilityScan'] = 'Recon'
dict_8classes['Recon-HostDiscovery'] = 'Recon'
dict_8classes['DNS_Spoofing'] = 'Spoofing'
dict_8classes['MITM-ArpSpoofing'] = 'Spoofing'
dict_8classes['BrowserHijacking'] = 'Web'
dict_8classes['Backdoor_Malware'] = 'Web'
dict_8classes['XSS'] = 'Web'
dict_8classes['Uploading_Attack'] = 'Web'
dict_8classes['SqlInjection'] = 'Web'
dict_8classes['CommandInjection'] = 'Web'
dict_8classes['DictionaryBruteForce'] = 'BruteForce'
dict_8classes['BenignTraffic'] = 'BenignTraffic'
category_labels = [dict_8classes[label] for label in df['label']]
df["category_label"] = category_labels
df.category_label.value_counts().plot(kind='bar')
<Axes: xlabel='category_label'>
X_train, X_test, y_train, y_test = train_test_split(df.drop(['label', 'category_label'], axis=1), df['category_label'], test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train
array([[-1.94929421e-02, -1.66170783e-01, -3.42858899e-01, ...,
-9.95041577e-02, -4.13856083e-01, 2.10318842e-03],
[-1.94929421e-02, -1.66170783e-01, -3.42858899e-01, ...,
-9.95041577e-02, -4.13856083e-01, 2.10318842e-03],
[-1.94929421e-02, -1.66287304e-01, 4.13247593e+00, ...,
-9.64875574e-02, -2.41900322e-01, 2.10318842e-03],
...,
[-1.94929421e-02, -1.66171948e-01, -3.49561847e-01, ...,
-9.95034195e-02, 1.60333186e-02, 2.10318842e-03],
[ 5.03788482e-01, -1.65594215e-01, 2.58032316e-02, ...,
-7.05213438e-02, 3.88503793e+00, 4.90838713e+00],
[-1.93909497e-02, -1.10958849e-01, 8.68140342e-01, ...,
-9.94934764e-02, 1.02011199e-01, 2.10318842e-03]])
X_test
array([[-0.00857309, -0.16582911, -0.31828142, ..., -0.09948054,
0.18798908, 0.00210319],
[-0.01949294, -0.16617078, -0.3428589 , ..., -0.09950416,
-0.41385608, 0.00210319],
[-0.01930461, -0.12696981, 0.88601487, ..., -0.09950416,
-0.41385608, 0.00210319],
...,
[-0.01949294, -0.16617078, -0.3428589 , ..., -0.09950416,
-0.41385608, 0.00210319],
[-0.01926746, -0.09869385, 0.88601487, ..., -0.09950416,
-0.41385608, 0.00210319],
[-0.01949294, -0.16617078, -0.3428589 , ..., -0.09950416,
-0.41385608, 0.00210319]])
models ={
'logistic': LogisticRegression(random_state=42, max_iter=500, n_jobs=-1),
'RandomForestClassifier': RandomForestClassifier(random_state=42, n_jobs=-1),
'MLP': MLPClassifier(hidden_layer_sizes=(50, 50, 10), random_state=42, early_stopping=True, verbose=True)
}
for i, (model_name, model) in enumerate(models.items()):
model.fit(X_train, y_train)
#model = load(f"models/binary/{model_name}.joblib")
y_pred = model.predict(X_test)
# Classification Report
print(f"Classification Report for {model_name}:")
print(classification_report(y_test, y_pred))
# Confusion Matrix
disp_cm = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, y_pred, normalize='true'), display_labels=label2id)
# Plot the confusion matrix
disp_cm.plot(cmap='Blues')
plt.title(f'Confusion Matrix - {model_name}')
plt.show()
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
Classification Report for logistic:
precision recall f1-score support
BenignTraffic 0.71 0.90 0.80 10789
BruteForce 1.00 0.15 0.26 129
DDoS 0.84 0.99 0.91 339357
DoS 0.85 0.21 0.33 80858
Mirai 0.99 0.99 0.99 26451
Recon 0.67 0.42 0.52 3631
Spoofing 0.76 0.47 0.58 4986
Web 1.00 0.01 0.03 229
accuracy 0.84 466430
macro avg 0.85 0.52 0.55 466430
weighted avg 0.84 0.84 0.80 466430
Classification Report for RandomForestClassifier:
precision recall f1-score support
BenignTraffic 0.91 0.98 0.94 10789
BruteForce 1.00 0.28 0.44 129
DDoS 1.00 1.00 1.00 339357
DoS 1.00 1.00 1.00 80858
Mirai 1.00 1.00 1.00 26451
Recon 0.89 0.81 0.84 3631
Spoofing 0.89 0.84 0.87 4986
Web 0.96 0.20 0.33 229
accuracy 1.00 466430
macro avg 0.96 0.76 0.80 466430
weighted avg 1.00 1.00 1.00 466430
Iteration 1, loss = 0.27689271
Validation score: 0.959565
Iteration 2, loss = 0.06017070
Validation score: 0.986633
Iteration 3, loss = 0.04149161
Validation score: 0.987072
Iteration 4, loss = 0.03698626
Validation score: 0.988085
Iteration 5, loss = 0.03538981
Validation score: 0.988734
Iteration 6, loss = 0.03337441
Validation score: 0.989216
Iteration 7, loss = 0.03241351
Validation score: 0.989682
Iteration 8, loss = 0.03133883
Validation score: 0.989591
Iteration 9, loss = 0.03033049
Validation score: 0.989666
Iteration 10, loss = 0.02930050
Validation score: 0.989650
Iteration 11, loss = 0.02868498
Validation score: 0.990074
Iteration 12, loss = 0.02803804
Validation score: 0.989264
Iteration 13, loss = 0.02761532
Validation score: 0.989720
Iteration 14, loss = 0.02739238
Validation score: 0.990534
Iteration 15, loss = 0.02706570
Validation score: 0.990047
Iteration 16, loss = 0.02661239
Validation score: 0.990610
Iteration 17, loss = 0.02671318
Validation score: 0.990829
Iteration 18, loss = 0.02661693
Validation score: 0.990626
Iteration 19, loss = 0.02625148
Validation score: 0.990213
Iteration 20, loss = 0.02591241
Validation score: 0.990492
Iteration 21, loss = 0.02569819
Validation score: 0.990786
Iteration 22, loss = 0.02554373
Validation score: 0.991006
Iteration 23, loss = 0.02537310
Validation score: 0.990685
Iteration 24, loss = 0.02527398
Validation score: 0.990963
Iteration 25, loss = 0.02509468
Validation score: 0.990894
Iteration 26, loss = 0.02506151
Validation score: 0.990556
Iteration 27, loss = 0.02504354
Validation score: 0.990920
Iteration 28, loss = 0.02477875
Validation score: 0.991371
Iteration 29, loss = 0.02475377
Validation score: 0.991430
Iteration 30, loss = 0.02462744
Validation score: 0.990845
Iteration 31, loss = 0.02439641
Validation score: 0.991296
Iteration 32, loss = 0.02446256
Validation score: 0.991044
Iteration 33, loss = 0.02422673
Validation score: 0.990995
Iteration 34, loss = 0.02419719
Validation score: 0.990808
Iteration 35, loss = 0.02412869
Validation score: 0.990931
Iteration 36, loss = 0.02413487
Validation score: 0.991387
Iteration 37, loss = 0.02414395
Validation score: 0.991296
Iteration 38, loss = 0.02395756
Validation score: 0.991028
Iteration 39, loss = 0.02393733
Validation score: 0.991547
Iteration 40, loss = 0.02399154
Validation score: 0.990786
Iteration 41, loss = 0.02395461
Validation score: 0.990985
Iteration 42, loss = 0.02374384
Validation score: 0.991376
Iteration 43, loss = 0.02383276
Validation score: 0.991537
Iteration 44, loss = 0.02383290
Validation score: 0.991124
Iteration 45, loss = 0.02356424
Validation score: 0.991676
Iteration 46, loss = 0.02373464
Validation score: 0.991237
Iteration 47, loss = 0.02363103
Validation score: 0.991360
Iteration 48, loss = 0.02359784
Validation score: 0.991714
Iteration 49, loss = 0.02364423
Validation score: 0.991526
Iteration 50, loss = 0.02354171
Validation score: 0.991440
Iteration 51, loss = 0.02353783
Validation score: 0.990749
Iteration 52, loss = 0.02349443
Validation score: 0.991360
Iteration 53, loss = 0.02349664
Validation score: 0.991371
Iteration 54, loss = 0.02346713
Validation score: 0.991639
Iteration 55, loss = 0.02326346
Validation score: 0.991322
Iteration 56, loss = 0.02338681
Validation score: 0.991467
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Classification Report for MLP:
precision recall f1-score support
BenignTraffic 0.83 0.94 0.88 10789
BruteForce 1.00 0.15 0.26 129
DDoS 1.00 1.00 1.00 339357
DoS 1.00 1.00 1.00 80858
Mirai 1.00 1.00 1.00 26451
Recon 0.80 0.62 0.70 3631
Spoofing 0.77 0.72 0.74 4986
Web 1.00 0.03 0.06 229
accuracy 0.99 466430
macro avg 0.92 0.68 0.70 466430
weighted avg 0.99 0.99 0.99 466430
We can see that the the metrics of these models are in fact similar to the metrics that were obtained in the paper (8 classes):
Our Accuracy:
Paper's Accuracy:
Our F1:
Paper's F1:
Even though these models have much better accuracies than the one we tested before, we believe that it is wrong to use the unbalanced data. The accuracy score merely represents the model's ability to classify correctly the majority of the classes, ignoring the ability for the model to work well with the less apparent ones. This is mainly because we can see that the most represented classes are classified very well, the others get misclassified very badly (as can be seen from the confusion matrixes), and we believe that when testing on a big dataset the models would tend to misclassify most of them.
This is even more apparent when we look at the f1 scores, that are 80% in this case (unbalanced data) while it was almost 88% in the balanced case while accuracy was higher for the unbalanced data.